情人節,你剛表白,而我已經開始選哪裡拍婚紗照了~

2021-02-21 CSDN

又到每年的 2 月 14 日了，最近這幾天，你肯定會在博客上看到，程式設計師花式秀恩愛，但橡皮擦就不一樣了，正在幫別人選婚紗照拍攝地。

當你 new 出來的對象問你，「北京在哪拍婚紗照便宜又好呀？」你啪啪啪把數據展示出來，絕對可以贏得你的小可愛那愛戀的眼神。

寫在前面挖掘目的已經確定，下面就是挖掘代碼編寫的時間了，作為年輕人，好好秀恩愛吧，苦差事就交給我們這些過來人。這次咱們的目標網站是：https://www.jiehun.com.cn/，遙想當年橡皮擦的婚紗照還是在婚博會上訂的呢~這個組織在每年春夏秋冬四季在北京、上海、廣州、天津、武漢、杭州、成都等地同時舉辦大型結婚展。
我們要抓取的就是上面商鋪的各種信息，包含商鋪名，商鋪地址，評星，點評數，價格。

數據抓取過程在做正式抓取之前，可以先編寫一個 demo，對數據進行簡單的抓取與解析，具體實現可以參照下文。其中用到了 XPath 解析，該網站如果不使用 UA 參數，無法獲取到數據，也算是一種最簡單的反爬手段吧。

def demo():    headers = {        "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}    url = f"https://www.jiehun.com.cn/beijing/ch2065/store-p13/?ce=beijing"    content = r.get(url, headers=headers).text    html = etree.HTML(content)    li_list = html.xpath("//div[@id='stlist']/ul/li")    for li in li_list:                star = li.xpath("./div[@class='comment']/p[1]/b/text()")        comment = "--"        if star:                        star = star[0]            comment = li.xpath("./div[@class='comment']/p[2]/a/text()")                        comment = comment[0]        else:            star = "--"
                name = li.xpath(".//a[@class='namelimit']/text()")[0]                store = li.xpath(            ".//div[@class='storename']/following-sibling::p[1]/text()")[0]                price = li.xpath(            ".//div[@class='storename']/following-sibling::p[2]/span[1]/text()")        if price:            price = li.xpath(                ".//div[@class='storename']/following-sibling::p[2]/span[1]/text()")[0]        else:            price = "--"        item = {            "name": name,            "store": store,            "price": price,            "star": star,            "comment": comment        }        print(item)
        with open("hun.json", "ab+") as filename:            filename.write(json.dumps(                item, ensure_ascii=False).encode("utf-8") + b"\n")
獲取到的數據存儲格式如下，以 JSON 格式存儲，讀取的時候每次讀取一行即可。{"name": "9Xi·婚紗攝影", "store": "商家地址：北京市朝陽區朝外大街丙10號9Xi結婚匯購物中心", "price": "¥4999", "star": "--", "comment": "--"}{"name": "非目環球旅拍", "store": "商家地址：杭州市濱江區非目影像(總店)", "price": "¥19800", "star": "--", "comment": "--"}{"name": "小白工作室(私人會所)", "store": "商家地址：朝陽北路天鵝灣北區7號樓二單元502(朝陽大悅城對面)", "price": "--", "star": "--", "comment": "--"}{"name": "朵美婚拍", "store": "商家地址：北京市朝陽區廣渠門外大街8號優士閣A座大堂底商", "price": "¥2999", "star": "--", "comment": "--"}{"name": "柏悅時尚藝術館", "store": "商家地址：立湯路186號龍德廣場四層F420A", "price": "--", "star": "--", "comment": "--"}
當測試數據抓取到之後，就可以對全北京的商鋪（其他地區的修改對應地址即可）進行批量抓取了，本次數據量雖然不大，但是橡皮擦依舊為你貼心的準備了多線程爬蟲（唉~沒那麼容易學習到爬蟲技術）。以下是完整代碼部分，就為了你能獲取到最全的數據，一次性的把代碼都提供給你了，你的手指可以放在點讚按鈕上，為橡皮擦點讚了。import threadingimport requests as rfrom queue import Queueimport timefrom lxml import etreeimport json
CRAWL_EXIT = FalsePARSE_EXIT = False

class ThreadCrawl(threading.Thread):    def __init__(self, thread_name, page_queue, data_queue):        super(ThreadCrawl, self).__init__()
        self.thread_name = thread_name        self.page_queue = page_queue        self.data_queue = data_queue        self.headers = {            "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}
    def run(self):        print("啟動", self.thread_name)        while not CRAWL_EXIT:            try:                                page = self.page_queue.get(False)                url = f"https://www.jiehun.com.cn/beijing/ch2065/store-p{page}/?ce=beijing"
                content = r.get(url, headers=self.headers).text                time.sleep(1)
                self.data_queue.put(content)
            except Exception as e:                print(e)
        print("結束", self.thread_name)

class ThreadParse(threading.Thread):    def __init__(self, thread_name, data_queue, filename, lock):        super(ThreadParse, self).__init__()        self.thread_name = thread_name        self.data_queue = data_queue        self.filename = filename        self.lock = lock
    def run(self):        print("啟動", self.thread_name)        while not PARSE_EXIT:            try:                html = self.data_queue.get(False)
                self.parse(html)
            except Exception as e:                print(e)        print("結束", self.thread_name)
    def parse(self, html):        html = etree.HTML(html)
        li_list = html.xpath("//div[@id='stlist']/ul/li")        for li in li_list:                        star = li.xpath("./div[@class='comment']/p[1]/b/text()")            comment = "--"            if star:                                star = star[0]                comment = li.xpath("./div[@class='comment']/p[2]/a/text()")                                comment = comment[0]            else:                star = "--"
                        name = li.xpath(".//a[@class='namelimit']/text()")[0]                        store = li.xpath(                ".//div[@class='storename']/following-sibling::p[1]/text()")[0]                        price = li.xpath(                ".//div[@class='storename']/following-sibling::p[2]/span[1]/text()")            if price:                price = li.xpath(                    ".//div[@class='storename']/following-sibling::p[2]/span[1]/text()")[0]            else:                price = "--"
            item = {                "name": name,                "store": store,                "price": price,                "star": star,                "comment": comment            }
            with self.lock:                self.filename.write(json.dumps(                    item, ensure_ascii=False).encode("utf-8") + b"\n")

def main():        page_queue = Queue(14)    for i in range(1, 15):        page_queue.put(i)
        data_queue = Queue()    filename = open("hun.json", "ab+")
        lock = threading.Lock()
        crawl_list = ["挖掘線程1", "挖掘線程2", "挖掘線程3"]
    threadcrawl = []    for thread_name in crawl_list:        thread = ThreadCrawl(thread_name, page_queue, data_queue)        thread.start()        threadcrawl.append(thread)
        parse_list = ["解析線程1", "解析線程2", "解析線程3"]    threadparse = []    for thread_name in parse_list:        thread = ThreadParse(thread_name, data_queue, filename, lock)        thread.start()        threadparse.append(thread)
        while not page_queue.empty():        pass
    global CRAWL_EXIT    CRAWL_EXIT = True
    print("page_queue為空")    for thread in threadcrawl:        thread.join()        print("挖掘隊列執行完畢")

    while not data_queue.empty():        pass    global PARSE_EXIT    PARSE_EXIT = True
    for thread in threadparse:        thread.join()        print("解析隊列執行完畢")
    with lock:        filename.close()
本次數據獲取，為了不讓你那麼容易就找到哪個商鋪價錢便宜，我專門存儲成了 JSON 格式，排序的工作就交給你自己來完成了。畢竟你不能有女朋友的同時，一點力也不出吧。

看圖選照
你以為這樣工作就做完了嗎？當然沒有，除了價格以外，咱們 new 出來的對象還需要看圖選呢，至少要看看誰家拍攝技術高，更符合自己的調調。
第一個步驟批量採集圖片詳情頁的地址；
第二步針對詳情頁地址，獲取圖片。
def demo():    headers = {        "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}    url = f"https://www.jiehun.com.cn/beijing/ch2065/album-p461/?attr_110=&cate_id=2065&ce=beijing"    content = r.get(url, headers=headers).text    html = etree.HTML(content)    a_list = html.xpath("//div[@class='rectangle_list']/ul/li/a/@href")
    for a in a_list:        with open("album_link.json", "a+") as filename:            filename.write(f"https://www.jiehun.com.cn/{a}\n")
短暫運行一段時間後，得到超連結數據如下：
https://www.jiehun.com.cn//album/730704/https://www.jiehun.com.cn//album/730703/https://www.jiehun.com.cn//album/730702/https://www.jiehun.com.cn//album/730701/https://www.jiehun.com.cn//album/730700/https://www.jiehun.com.cn//album/730699/https://www.jiehun.com.cn//album/730698/https://www.jiehun.com.cn//album/730697/https://www.jiehun.com.cn//album/730696/https://www.jiehun.com.cn//album/730695/https://www.jiehun.com.cn//album/730694/https://www.jiehun.com.cn//album/730693/https://www.jiehun.com.cn//album/730679/
圖片抓取，利用 album_link.json 中存儲的連結地址，解析對應頁面中的 img 標籤。讀取 album_link.json 中的數據，生成待抓取連結。def read_file():    page_queue = Queue()    for f in open("album_link.json","r"):        print(f.strip())        page_queue.put(f)
    print(page_queue.qsize())
def demo():    headers = {        "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"}    url = f"https://www.jiehun.com.cn/album/730698/"    content = r.get(url, headers=headers).text    html = etree.HTML(content)    title = html.xpath("//div[@class='detailintro_l']/h2/text()")[0]    img_list = html.xpath("//div[@class='img']/img/@src")
    for index,img_url in enumerate(img_list):        content = r.get(img_url, headers=headers).content        with open(f"./imgs/{title}-{index}.jpg", "wb+") as filename:            filename.write(content)
編寫好完整的代碼之後，等著照片存儲到本地，就可以給對象一張張的看了，看到好的，在查一下是哪個攝影店，去拍攝就好了。

代碼運行一會，就 600+ 照片了，因為硬碟比較小，剩下的大家自己爬取吧
其實是被那一張張狂撒狗糧的照片，餵飽了~，ε=(´ο ｀*))) 唉，那一張張 KISS 照片，一直在給橡皮擦暴擊。
☞「面向對象就是一個錯誤！」

情人節,你剛表白,而我已經開始選哪裡拍婚紗照了~

相關焦點

結婚登記照,婚紗照,是真愛就免費拍!

一般情人節送什麼好,情人節表白禮物推薦

情人節表白賀卡電子版在線製作免費

【情人節表白】情人節有哪些浪漫的表白方式,情人節表白方式分享

2.14情人節,表白相冊賀卡微信製作

微信版情人節表白視頻相冊免費製作

今天|我的情人節禮物

表白牆、表白賀卡,情人節這樣玩!

214情人節史上最喪心病狂的激情表白——街頭採訪「愛就要大聲吼出來」

白色情人節:讓英語愛情名言幫你表白!

林志玲婚紗照曝光!網友:這是在拍婚紗?

前一天剛在泰國拍完婚紗照

你是從什麼時候開始,再也不過情人節了?

災難系婚紗照,你能猜到在哪兒嗎?

吳奇隆劉詩詩婚紗照曝光,長得醜就拍不出婚紗照大片嗎?

TVB懷孕小花曬與老公恩愛婚紗照慶祝結婚2周年

情人節適合表白的5首歌,簡單易唱又暖心

新潮婚紗照原來還可以這樣拍,漲姿勢了

2017年情人節求婚表白的話:電影感人求婚詞臺詞句子

女孩反串新郎,陪96歲奶奶拍婚紗照,一句話暖哭所有人…