又到每年的 2 月 14 日了,最近這幾天,你肯定會在博客上看到,程式設計師花式秀恩愛,但橡皮擦就不一樣了,正在幫別人選婚紗照拍攝地。
當你 new 出來的對象問你,「北京在哪拍婚紗照便宜又好呀?」 你啪啪啪把數據展示出來,絕對可以贏得你的小可愛那愛戀的眼神。
寫在前面挖掘目的已經確定,下面就是挖掘代碼編寫的時間了,作為年輕人,好好秀恩愛吧,苦差事就交給我們這些過來人。這次咱們的目標網站是:https://www.jiehun.com.cn/,遙想當年橡皮擦的婚紗照還是在婚博會上訂的呢~這個組織在每年春夏秋冬四季在北京、上海、廣州、天津、武漢、杭州、成都等地同時舉辦大型結婚展。def demo(): headers = { "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"} url = f"https://www.jiehun.com.cn/beijing/ch2065/store-p13/?ce=beijing" content = r.get(url, headers=headers).text html = etree.HTML(content) li_list = html.xpath("//div[@id='stlist']/ul/li") for li in li_list: star = li.xpath("./div[@class='comment']/p[1]/b/text()") comment = "--" if star: star = star[0] comment = li.xpath("./div[@class='comment']/p[2]/a/text()") comment = comment[0] else: star = "--"
name = li.xpath(".//a[@class='namelimit']/text()")[0] store = li.xpath( ".//div[@class='storename']/following-sibling::p[1]/text()")[0] price = li.xpath( ".//div[@class='storename']/following-sibling::p[2]/span[1]/text()") if price: price = li.xpath( ".//div[@class='storename']/following-sibling::p[2]/span[1]/text()")[0] else: price = "--" item = { "name": name, "store": store, "price": price, "star": star, "comment": comment } print(item)
with open("hun.json", "ab+") as filename: filename.write(json.dumps( item, ensure_ascii=False).encode("utf-8") + b"\n")獲取到的數據存儲格式如下,以 JSON 格式存儲,讀取的時候每次讀取一行即可。{"name": "9Xi·婚紗攝影", "store": "商家地址:北京市朝陽區朝外大街丙10號9Xi結婚匯購物中心", "price": "¥4999", "star": "--", "comment": "--"}{"name": "非目環球旅拍", "store": "商家地址:杭州市濱江區非目影像(總店)", "price": "¥19800", "star": "--", "comment": "--"}{"name": "小白工作室(私人會所)", "store": "商家地址:朝陽北路天鵝灣北區7號樓二單元502(朝陽大悅城對面)", "price": "--", "star": "--", "comment": "--"}{"name": "朵美婚拍", "store": "商家地址:北京市朝陽區廣渠門外大街8號優士閣A座大堂底商", "price": "¥2999", "star": "--", "comment": "--"}{"name": "柏悅時尚藝術館", "store": "商家地址:立湯路186號龍德廣場四層F420A", "price": "--", "star": "--", "comment": "--"}當測試數據抓取到之後,就可以對全北京的商鋪(其他地區的修改對應地址即可)進行批量抓取了,本次數據量雖然不大,但是橡皮擦依舊為你貼心的準備了多線程爬蟲(唉~沒那麼容易學習到爬蟲技術)。以下是完整代碼部分,就為了你能獲取到最全的數據,一次性的把代碼都提供給你了,你的手指可以放在點讚按鈕上,為橡皮擦點讚了。import threadingimport requests as rfrom queue import Queueimport timefrom lxml import etreeimport json
CRAWL_EXIT = FalsePARSE_EXIT = False
class ThreadCrawl(threading.Thread): def __init__(self, thread_name, page_queue, data_queue): super(ThreadCrawl, self).__init__()
self.thread_name = thread_name self.page_queue = page_queue self.data_queue = data_queue self.headers = { "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}
def run(self): print("啟動", self.thread_name) while not CRAWL_EXIT: try: page = self.page_queue.get(False) url = f"https://www.jiehun.com.cn/beijing/ch2065/store-p{page}/?ce=beijing"
content = r.get(url, headers=self.headers).text time.sleep(1)
self.data_queue.put(content)
except Exception as e: print(e)
print("結束", self.thread_name)
class ThreadParse(threading.Thread): def __init__(self, thread_name, data_queue, filename, lock): super(ThreadParse, self).__init__() self.thread_name = thread_name self.data_queue = data_queue self.filename = filename self.lock = lock
def run(self): print("啟動", self.thread_name) while not PARSE_EXIT: try: html = self.data_queue.get(False)
self.parse(html)
except Exception as e: print(e) print("結束", self.thread_name)
def parse(self, html): html = etree.HTML(html)
li_list = html.xpath("//div[@id='stlist']/ul/li") for li in li_list: star = li.xpath("./div[@class='comment']/p[1]/b/text()") comment = "--" if star: star = star[0] comment = li.xpath("./div[@class='comment']/p[2]/a/text()") comment = comment[0] else: star = "--"
name = li.xpath(".//a[@class='namelimit']/text()")[0] store = li.xpath( ".//div[@class='storename']/following-sibling::p[1]/text()")[0] price = li.xpath( ".//div[@class='storename']/following-sibling::p[2]/span[1]/text()") if price: price = li.xpath( ".//div[@class='storename']/following-sibling::p[2]/span[1]/text()")[0] else: price = "--"
item = { "name": name, "store": store, "price": price, "star": star, "comment": comment }
with self.lock: self.filename.write(json.dumps( item, ensure_ascii=False).encode("utf-8") + b"\n")
def main(): page_queue = Queue(14) for i in range(1, 15): page_queue.put(i)
data_queue = Queue() filename = open("hun.json", "ab+")
lock = threading.Lock()
crawl_list = ["挖掘線程1", "挖掘線程2", "挖掘線程3"]
threadcrawl = [] for thread_name in crawl_list: thread = ThreadCrawl(thread_name, page_queue, data_queue) thread.start() threadcrawl.append(thread)
parse_list = ["解析線程1", "解析線程2", "解析線程3"] threadparse = [] for thread_name in parse_list: thread = ThreadParse(thread_name, data_queue, filename, lock) thread.start() threadparse.append(thread)
while not page_queue.empty(): pass
global CRAWL_EXIT CRAWL_EXIT = True
print("page_queue為空") for thread in threadcrawl: thread.join() print("挖掘隊列執行完畢")
while not data_queue.empty(): pass global PARSE_EXIT PARSE_EXIT = True
for thread in threadparse: thread.join() print("解析隊列執行完畢")
with lock: filename.close()本次數據獲取,為了不讓你那麼容易就找到哪個商鋪價錢便宜,我專門存儲成了 JSON 格式,排序的工作就交給你自己來完成了。畢竟你不能有女朋友的同時,一點力也不出吧。看圖選照
你以為這樣工作就做完了嗎?當然沒有,除了價格以外,咱們 new 出來的對象還需要看圖選呢,至少要看看誰家拍攝技術高,更符合自己的調調。第一個步驟批量採集圖片詳情頁的地址;
第二步針對詳情頁地址,獲取圖片。
def demo(): headers = { "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"} url = f"https://www.jiehun.com.cn/beijing/ch2065/album-p461/?attr_110=&cate_id=2065&ce=beijing" content = r.get(url, headers=headers).text html = etree.HTML(content) a_list = html.xpath("//div[@class='rectangle_list']/ul/li/a/@href")
for a in a_list: with open("album_link.json", "a+") as filename: filename.write(f"https://www.jiehun.com.cn/{a}\n")短暫運行一段時間後,得到超連結數據如下:
https://www.jiehun.com.cn//album/730704/https://www.jiehun.com.cn//album/730703/https://www.jiehun.com.cn//album/730702/https://www.jiehun.com.cn//album/730701/https://www.jiehun.com.cn//album/730700/https://www.jiehun.com.cn//album/730699/https://www.jiehun.com.cn//album/730698/https://www.jiehun.com.cn//album/730697/https://www.jiehun.com.cn//album/730696/https://www.jiehun.com.cn//album/730695/https://www.jiehun.com.cn//album/730694/https://www.jiehun.com.cn//album/730693/https://www.jiehun.com.cn//album/730679/圖片抓取,利用 album_link.json 中存儲的連結地址,解析對應頁面中的 img 標籤。讀取 album_link.json 中的數據,生成待抓取連結。def read_file(): page_queue = Queue() for f in open("album_link.json","r"): print(f.strip()) page_queue.put(f)
print(page_queue.qsize())def demo(): headers = { "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"} url = f"https://www.jiehun.com.cn/album/730698/" content = r.get(url, headers=headers).text html = etree.HTML(content) title = html.xpath("//div[@class='detailintro_l']/h2/text()")[0] img_list = html.xpath("//div[@class='img']/img/@src")
for index,img_url in enumerate(img_list): content = r.get(img_url, headers=headers).content with open(f"./imgs/{title}-{index}.jpg", "wb+") as filename: filename.write(content)編寫好完整的代碼之後,等著照片存儲到本地,就可以給對象一張張的看了,看到好的,在查一下是哪個攝影店,去拍攝就好了。
代碼運行一會,就 600+ 照片了,因為硬碟比較小,剩下的大家自己爬取吧其實是被那一張張狂撒狗糧的照片,餵飽了~,ε=(´ο `*))) 唉,那一張張 KISS 照片,一直在給橡皮擦暴擊。
☞「面向對象就是一個錯誤!」