Python爬蟲實戰批量下載高清美女圖片

2021-03-06 Python學習交流樂園

彼岸圖網站裡有大量的高清圖片素材和壁紙，並且可以免費下載，讀者也可以根據自己需要爬取其他類型圖片，方法是類似的，本文通過python爬蟲批量下載網站裡的高清美女圖片，熟悉python寫爬蟲的基本方法：發送請求、獲取響應、解析並提取數據、保存到本地。

很多人學習python，不知道從何學起。

很多人學習python，掌握了基本語法過後，不知道在哪裡尋找案例上手。

很多已經做案例的人，卻不知道如何去學習更加高深的知識。

那麼針對這三類人，我給大家提供一個好的學習平臺，免費領取視頻教程，電子書籍，以及課程的原始碼！

QQ群：101677771

目標url：http://pic.netbian.com/4kmeinv/index.html

1. 爬取一頁的圖片正則匹配提取圖片數據

網頁原始碼部分截圖如下：

重新設置GBK編碼解決了亂碼問題

代碼實現：

import requests
import re

# 設置保存路徑
path = r'D:\test\picture_1\ '
# 目標url
url = "http://pic.netbian.com/4kmeinv/index.html"
# 偽裝請求頭防止被反爬
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Referer": "http://pic.netbian.com/4kmeinv/index.html"
}

# 發送請求獲取響應
response = requests.get(url, headers=headers)
# 列印網頁原始碼來看亂碼重新設置編碼解決編碼問題
# 內容正常顯示便於之後提取數據
response.encoding = 'GBK'

# 正則匹配提取想要的數據得到圖片連結和名稱
img_info = re.findall('img src="(.*?)" alt="(.*?)" /', response.text)

for src, name in img_info:
img_url = 'http://pic.netbian.com' + src # 加上 'http://pic.netbian.com'才是真正的圖片url
img_content = requests.get(img_url, headers=headers).content
img_name = name + '.jpg'
with open(path + img_name, 'wb') as f: # 圖片保存到本地
print(f"正在為您下載圖片：{img_name}")
f.write(img_content)

Xpath定位提取圖片數據

檢查分析網頁可以找到圖片的連結和名稱的Xpath路徑，寫出xpath表達式定位提取出想要的圖片數據，但得到的每個圖片的src前面需要都加上『http://pic.netbian.com』得到的才是圖片真正的url，可以用列表推導式一行代碼實現。

代碼實現：

import requests
from lxml import etree

# 設置保存路徑
path = r'D:\test\picture_1\ '
# 目標url
url = "http://pic.netbian.com/4kmeinv/index.html"
# 偽裝請求頭防止被反爬
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Referer": "http://pic.netbian.com/4kmeinv/index.html"
}

# 發送請求獲取響應
response = requests.get(url, headers=headers)
# 列印網頁原始碼來看亂碼重新設置編碼解決編碼問題
# 內容正常顯示便於之後提取數據
response.encoding = 'GBK'
html = etree.HTML(response.text)
# xpath定位提取想要的數據得到圖片連結和名稱
img_src = html.xpath('//ul[@class="clearfix"]/li/a/img/@src')
# 列表推導式得到真正的圖片url
img_src = ['http://pic.netbian.com' + x for x in img_src]
img_alt = html.xpath('//ul[@class="clearfix"]/li/a/img/@alt')

for src, name in zip(img_src, img_alt):
img_content = requests.get(src, headers=headers).content
img_name = name + '.jpg'
with open(path + img_name, 'wb') as f: # 圖片保存到本地
print(f"正在為您下載圖片：{img_name}")
f.write(img_content)

2.翻頁爬取，實現批量下載

手動翻頁分析規律
第一頁：http://pic.netbian.com/4kmeinv/index.html
第二頁：http://pic.netbian.com/4kmeinv/index_2.html
第三頁：http://pic.netbian.com/4kmeinv/index_3.html
最後一頁：http://pic.netbian.com/4kmeinv/index_161.html
分析發現除第一頁比較特殊，之後的頁面都有規律，可以用列表推導式生成url列表，遍歷url列表裡的連結，進行請求，可實現翻頁爬取圖片。

單線程版

import requests
from lxml import etree
import datetime
import time

# 設置保存路徑
path = r'D:\test\picture_1\ '
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Referer": "http://pic.netbian.com/4kmeinv/index.html"
}
start = datetime.datetime.now()

def get_img(urls):
for url in urls:
# 發送請求獲取響應
response = requests.get(url, headers=headers)
# 列印網頁原始碼來看亂碼重新設置編碼解決編碼問題
# 內容正常顯示便於之後提取數據
response.encoding = 'GBK'
html = etree.HTML(response.text)
# xpath定位提取想要的數據得到圖片連結和名稱
img_src = html.xpath('//ul[@class="clearfix"]/li/a/img/@src')
# 列表推導式得到真正的圖片url
img_src = ['http://pic.netbian.com' + x for x in img_src]
img_alt = html.xpath('//ul[@class="clearfix"]/li/a/img/@alt')
for src, name in zip(img_src, img_alt):
img_content = requests.get(src, headers=headers).content
img_name = name + '.jpg'
with open(path + img_name, 'wb') as f: # 圖片保存到本地
# print(f"正在為您下載圖片：{img_name}")
f.write(img_content)
time.sleep(1)

def main():
# 要請求的url列表
url_list = ['http://pic.netbian.com/4kmeinv/index.html'] + [f'http://pic.netbian.com/4kmeinv/index_{i}.html' for i in range(2, 11)]
get_img(url_list)
delta = (datetime.datetime.now() - start).total_seconds()
print(f"抓取10頁圖片用時：{delta}s")

if __name__ == '__main__':
main()

程序運行成功，抓取了10頁的圖片，共210張，用時63.682837s。

多線程版

import requests
from lxml import etree
import datetime
import time
import random
from concurrent.futures import ThreadPoolExecutor

# 設置保存路徑
path = r'D:\test\picture_1\ '
user_agent = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
start = datetime.datetime.now()

def get_img(url):
headers = {
"User-Agent": random.choice(user_agent),
"Referer": "http://pic.netbian.com/4kmeinv/index.html"
}
# 發送請求獲取響應
response = requests.get(url, headers=headers)
# 列印網頁原始碼來看亂碼重新設置編碼解決編碼問題
# 內容正常顯示便於之後提取數據
response.encoding = 'GBK'
html = etree.HTML(response.text)
# xpath定位提取想要的數據得到圖片連結和名稱
img_src = html.xpath('//ul[@class="clearfix"]/li/a/img/@src')
# 列表推導式得到真正的圖片url
img_src = ['http://pic.netbian.com' + x for x in img_src]
img_alt = html.xpath('//ul[@class="clearfix"]/li/a/img/@alt')
for src, name in zip(img_src, img_alt):
img_content = requests.get(src, headers=headers).content
img_name = name + '.jpg'
with open(path + img_name, 'wb') as f: # 圖片保存到本地
# print(f"正在為您下載圖片：{img_name}")
f.write(img_content)
time.sleep(random.randint(1, 2))

def main():
# 要請求的url列表
url_list = ['http://pic.netbian.com/4kmeinv/index.html'] + [f'http://pic.netbian.com/4kmeinv/index_{i}.html' for i in range(2, 51)]
with ThreadPoolExecutor(max_workers=6) as executor:
executor.map(get_img, url_list)
delta = (datetime.datetime.now() - start).total_seconds()
print(f"爬取50頁圖片用時：{delta}s")

if __name__ == '__main__':
main()

程序運行成功，抓取了50頁圖片，共1047張，用時56.71979s。開多線程大大提高的爬取數據的效率。

最終成果如下：

3. 其他說明

Python爬蟲實戰批量下載高清美女圖片

相關焦點

18個Python爬蟲實戰案例(已開源)

python網頁爬蟲實戰:PEER資料庫地震波批量下載

python爬蟲百度搜索圖片

廖雪峰老師的Python商業爬蟲課程 Python網絡爬蟲實戰教程體會不一樣的Python爬蟲課程

在知乎上學 Python - 爬蟲篇

一小時入門 Python 3 網絡爬蟲

Python開發簡單爬蟲【學習資料總結】

Python爬蟲實戰教程——爬取xkcd漫畫

Python爬蟲,批量獲取知網文獻信息

從零開始的python爬蟲速成指南

Python 網絡爬蟲實戰:爬取並下載《電影天堂》3千多部動作片電影

python爬蟲系列(4)- 提取網頁數據(正則表達式、bs4、xpath)

Python批量修改圖片尺寸

PYTHON-1 根據excel中的url 批量下載圖片

適合新手學習的Python爬蟲書籍

批量下載文獻法寶①——關鍵詞篇

python爬蟲系列(3)- 網頁數據解析(bs4、lxml、Json庫)

Python爬蟲從爪巴到爬

Python爬蟲常用庫之requests詳解

Python視頻教程網課編程零基礎入門數據分析網絡爬蟲全套Python...

Python爬蟲實戰 批量下載高清美女圖片

相關焦點

18個Python爬蟲實戰案例(已開源)

python網頁爬蟲實戰:PEER資料庫地震波批量下載

python爬蟲百度搜索圖片

廖雪峰老師的Python商業爬蟲課程 Python網絡爬蟲實戰教程 體會不一樣的Python爬蟲課程

在知乎上學 Python - 爬蟲篇

一小時入門 Python 3 網絡爬蟲

Python開發簡單爬蟲【學習資料總結】

Python爬蟲實戰教程——爬取xkcd漫畫

Python爬蟲,批量獲取知網文獻信息

從零開始的python爬蟲速成指南

Python 網絡爬蟲實戰:爬取並下載《電影天堂》3千多部動作片電影

python爬蟲系列(4)- 提取網頁數據(正則表達式、bs4、xpath)

Python批量修改圖片尺寸

PYTHON-1 根據excel中的url 批量下載圖片

適合新手學習的Python爬蟲書籍

批量下載文獻法寶①——關鍵詞篇

python爬蟲系列(3)- 網頁數據解析(bs4、lxml、Json庫)

Python爬蟲從爪巴到爬

Python爬蟲常用庫之requests詳解

Python視頻教程網課編程零基礎入門數據分析網絡爬蟲全套Python...

Python爬蟲實戰批量下載高清美女圖片

廖雪峰老師的Python商業爬蟲課程 Python網絡爬蟲實戰教程體會不一樣的Python爬蟲課程