PyQuery:一個爬蟲界最簡潔優雅的庫

2021-02-24 戀習Python

簡潔的PyQuery庫

pyquery庫是jQuery的Python實現，能夠以jQuery的語法來操作解析 HTML 文檔，易用性和解析速度都很好。特別適合進行訪問和解析網頁數據。

PyQuery庫官方文檔 https://pythonhosted.org/pyquery/index.html

本文章節：

初始化為PyQuery對象

常用的CCS選擇器

偽類選擇器

查找標籤

獲取標籤信息

高級方法

一、初始化為PyQuery對象

html_string = """

<head>

簡單好用的

<title>PyQuery</title>

</head>

<body>

<li>Python</li>

</ul>

</body>

</html>

"""

相當於BeautifulSoup庫的初識化方法，將html轉化為BeautifulSoup對象。

bsObj = BeautifulSoup(html, 'html.parser')

PyQuery庫也要有自己的初始化。

1.1 將字符串初始化

from pyquery import PyQuery

#初始化為PyQuery對象

doc = PyQuery(html_string)

print(type(doc))

print(doc)

Run

<head>

簡單好用的

<title>PyQuery</title>

</head>

<body>

<li class="object-1">Python</li>

</ul>

</body>

</html>

1.2 將html文件初始化

#filename參數為html文件路徑

doc = PyQuery(filename = 'test.html')

print(type(doc))

print(doc)

1.3 對網站訪問並初始化

response = PyQuery(url = 'https://www.baidu.com')

print(type(response))

print(response)

Run

<html> <head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css"/><title>ç¾åº¦ä¸ä¸ï¼ä½ å°±ç¥é</title></head> <body link="#0000cc"> <div id="wrapper"> <div id="head"> <div class="head_wrapper"> <div class="s_form"> <div class="s_form_wrapper"> <div id="lg"> <img hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270" height="129"/> </div> <form id="form" name="f" action="//www.baidu.com/s" class="fm"> <input type="hidden" name="bdorz_come" value="1"/> <input type="hidden" name="ie" value="utf-8"/> <input type="hidden" name="f" value="8"/> <input type="hidden" name="rsv_bp" value="1"/> <input type="hidden" name="rsv_idx" value="1"/> <input type="hidden" name="tn" value="baidu"/><span class="bg s_ipt_wr"><input id="kw" name="wd" class="s_ipt" value="" maxlength="255" autocomplete="off" autofocus="autofocus"/></span><span class="bg s_btn_wr"><input type="submit" id="su" value="ç¾åº¦ä¸ä¸" class="bg s_btn" autofocus=""/></span> </form> </div> </div> <div id="u1"> <a href="http://news.baidu.com" name="tj_trnews" class="mnav">æ°é»</a> <a href="https://www.hao123.com" name="tj_trhao123" class="mnav">hao123</a> <a href="http://map.baidu.com" name="tj_trmap" class="mnav">å°å¾</a> <a href="http://v.baidu.com" name="tj_trvideo" class="mnav">è§é¢</a> <a href="http://tieba.baidu.com" name="tj_trtieba" class="mnav">è´´å§</a> <noscript> <a 。。。。<div id="ftCon"> <div id="ftConw"> <p id="lh"> <a href="http://home.baidu.com">å³äºç¾åº¦</a> <a href="http://ir.baidu.com">About Baidu</a> </p> <p id="cp">©2017 Baidu <a href="http://www.baidu.com/duty/">ä½¿ç¨ç¾åº¦åå¿è¯»</a> <a href="http://jianyi.baidu.com/" class="cp-feedback">æè§åé¦</a> äº¬ICPè¯030173å· <img src="//www.baidu.com/img/gs.gif"/> </p> </div> </div> </div> </body> </html>

上面的字符串出現亂碼，所以需要設置使用encoding參數

response = PyQuery(url = 'https://www.baidu.com', encoding='utf-8')

print(type(response))

print(response)

</script> <a href="//www.baidu.com/more/" name="tj_briicon" class="bri" style="display: block;">更多產品</a> </div> </div> </div> <div id="ftCon"> <div id="ftConw"> <p id="lh"> <a href="http://home.baidu.com">關於百度</a> <a href="http://ir.baidu.com">About Baidu</a> </p> <p id="cp">©2017 Baidu <a href="http://www.baidu.com/duty/">使用百度前必讀</a> <a href="http://jianyi.baidu.com/" class="cp-feedback">意見反饋</a> 京ICP證030173號 <img src="//www.baidu.com/img/gs.gif"/> </p> </div> </div> </div> </body> </html>

二、常用的CCS選擇器

從這一節開始，我們就要對PyQuery對象進行操作，獲得我們想要的各種數據。你會看到解析網頁，不論什麼苦，基本原理實際上是沒有區別的，學會了一種解析庫。再看其他解析庫文檔，是很容易理解的。

在 CSS 中，選擇器是一種模式，用於選擇需要添加樣式的元素。

常用的css選擇器

本文只講clas id 和element最常見的css選擇器。對css不太懂了，也沒關係，可以去w3c官網的css選擇器，查看下文檔

#使用上文的doc

print(doc)

Run

<head>

簡單好用的

<title>PyQuery</title>

</head>

<body>

<li class="object-1">Python</li>

</ul>

</body>

</html>

2.1 列印id為container的標籤

print(doc('#container'))

print(type(doc('#container')))

Run

<li class="object-1">Python</li>

</ul>

2.2 列印class為object-1的標籤

print(doc('.object-1'))

Run

<li class="object-1">Python</li>

列印標籤名為body的標籤

print(doc('body'))

Run

<body>

<li class="object-1">Python</li>

</ul>

</body>

2.3 多種css選擇器使用

選出ul節點，有很多種表達方式。比如ul節點（該ul節點中的屬性鍵值對為id=container）

print(doc("ul[id=container]"))

Run

<li class="object-1">Python</li>

</ul>

三、偽類選擇器

pseudo_html = """

<head>

簡單好用的

<title>PyQuery</title>

</head>

<body>

<li>Python</li>

</ul>

</body>

</html>

"""

pseudo_doc = PyQuery(pseudo_html)

print(pseudo_doc)

Run

<head>

簡單好用的

<title>PyQuery</title>

</head>

<body>

<li class="object-1">Python</li>

</ul>

</body>

</html>

3.1 偽類nth

print(pseudo_doc('li:nth-child(2)'))

#列印第一個li標籤

print(pseudo_doc('li:first-child'))

#列印最後一個標籤

print(pseudo_doc('li:last-child'))

Run

<li class="object-1">Python</li>

3.2 contains

#找到含有Python的li標籤

print(pseudo_doc("li:contains('Python')"))

#找到含有好的li標籤

print(pseudo_doc("li:contains('好')"))

Run

<li class="object-1">Python</li>

四、查找標籤

PyQuery對象擁有很多實用的定位方法

print(doc)

Run

<head>

簡單好用的

<title>PyQuery</title>

</head>

<body>

<li class="object-1">Python</li>

</ul>

</body>

</html>

獲得所有的li標籤

doc('li')

Run

[<li.object-1>, <li.object-2>, <li.object-3>]

查看li標籤的類型

[type(l) for l in doc('li')]

Run

[lxml.etree._Element, lxml.etree._Element, lxml.etree._Element]

doc('li')返回的不是PyQuery類型，而是lxml.etree.Element類型。Element.text獲取節點的文本內容

[l.text for l in doc('li')]

Run

['Python', '大法', '好']

獲取所有的li，並以PyQuery形式逐個迭代

doc.items('li')

Run

查看doc.items迭代出的對象的數據類型

[type(l) for l in doc.items('li')]

Run

[pyquery.pyquery.PyQuery, pyquery.pyquery.PyQuery, pyquery.pyquery.PyQuery]

PyQuery類型擁有.text()方法

[l.text() for l in doc.items('li')]

Run

['Python', '大法', '好']

doc('body')

Run

[<body>]

獲得doc節點的（當前為整個文檔）的 id='container' 的節點

doc.find('#container')

Run

[<ul#container>]

id='container' 的節點的子節點們

id='container' 的節點的子節點們（節點內帶字符串）

doc.find('#container').children()

Run

[<li.object-1>, <li.object-2>, <li.object-3>]

doc.find('#container').contents()

Run

['\n ', <Element li at 0x10e57ac08>, '\n ', <Element li at 0x10d687848>, '\n ', <Element li at 0x10d687e08>, '\n ']

id='container' 的節點的第二個子節點

doc.find('#container').contents().eq(1)

Run

[<li.object-1>]

id='container' 的節點的子節點中不含有 class='object-2' 的節點

doc.find('#container').children().not_('.object-2')

Run

[<li.object-1>, <li.object-3>]

.map方法傳入兩個參數，分別是索引值和element。

def func(i, e):

return PyQuery(e).text()

doc('li').map(func)

Run

['Python', '大法', '好']

doc('li').map(lambda i,e: PyQuery(e).text())

Run

['Python', '大法', '好']

輸出ul的子節點的html

doc('ul').html()

Run

'\n <li>Python</li>\n <li>大法</li>\n <li>好</li>\n '

五、pyquery高級用法

PyQuery與BeautifulSoup對比，我們會發現PyQuery可以對網址發起請求。比如

from pyquery import PyQuery

PyQuery(url = 'https://www.baidu.com')

5.1 opener參數

PyQuery對百度網址進行請求，並將請求返回的響應數據處理為PyQuery對象。一般pyquery庫會默認調用urllib庫，如果想使用selenium或者requests庫，可以自定義PyQuery的opener參數。

opener參數作用是告訴pyquery用什麼請求庫對網址發起請求。常見的請求庫如urllib、requests、selenium。這裡我們自定義一個selenium的opener。

from pyquery import PyQuery

from selenium.webdriver import Chrome

from selenium.webdriver.chrome.options import Options

#用selenium訪問url

def selenium_opener(url):

#我沒有將Phantomjs放到環境變量，所以每次用都要放上路徑

#driver = PhantomJS(executable_path = 'phantomjs的路徑')

chrome_options = Options()

chrome_options.add_argument('--headless')

chrome_options.add_argument('--disable-gpu')

driver = Chrome(executable_path='chromedriver所在的路徑',

options=chrome_options)

driver.get(url)

html = driver.page_source

driver.quit()

return html

#注意，使用時opener參數是函數名，沒有括號的！

PyQuery(url='https://www.baidu.com/',

opener=selenium_opener)

Run

[<html>]

這時候我們就能上面學到的知識對PyQuery對象進行操作，提取有用的信息。

5.2 cookies、headers

在requests用法中，一般為了訪問網址更加真實，模仿成瀏覽器。一般我們需要傳入headers，必要的時候還需要傳入cookies參數。而pyquery庫就有這功能，也能偽裝瀏覽器。

from pyquery import PyQuery

cookies = {'Cookie':'你的cookie'}

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

PyQuery(url='https://www.baidu.com/',

headers=headers,

cookies=cookies)

Run注意，我這裡返回的內容你可能覺得很少，是因為這是PyQuery對象。PyQuery.html()

[<html>]

5.3 讓你的selenium帶上pyquery功能

如何讓driver訪問的網址得到的網頁直接變為PyQuery對象，更方便提取數據？

本來我想用Phantomjs自定義一個類，讓selenium擁有pyquery的功能。後來測試發現phantomjs去年不再支持selenium，因為曾經的無頭瀏覽器phantomjs具有不可替代的作用，但如今chrome和firefox都能提供無頭瀏覽器的功能。

from pyquery import PyQuery

from selenium.webdriver import Chrome

from selenium.webdriver.chrome.options import Options

class Browser(Chrome):

#Browser繼承Chrome類，聲明Chrome為無頭模式訪問網站

def __init__(self, executable_path):

chrome_options = Options()

chrome_options.add_argument('--headless')

chrome_options.add_argument('--disable-gpu')

Chrome.__init__(self,

executable_path=executable_path,

options=chrome_options)

#property是裝飾器，需要知道@property下面緊跟的函數，實現了類的屬性功能。

@property

def dom(self):

return PyQuery(self.page_source)

browser = Browser(executable_path='chromedriver所在的路徑')

browser.get(url='https://www.baidu.com/')

print(type(browser.dom))

Run

這幾個對pyquery功能的擴展，我覺得實現方式很不錯，很美觀簡潔，以後我會多模仿，比如用函數或類的方式，對已有的庫及函數進行功能加持。

覺得內容還不錯的話，給我點個「在看」唄

PyQuery:一個爬蟲界最簡潔優雅的庫

相關焦點

蟲術:初入爬蟲界

西子閣:從奢華繁複還原低調簡潔

最後的糧倉: 末日種子庫

C語言:優雅的字符串函數庫

圖文:北極圈種子庫

Wedding | J'adore 法式優雅,浪漫主義最完美的婚禮

【末日種子庫】大門開啟!

南海意庫:綠色的秘密

寶貝討喜的過年新衣LOOK,媽媽們一個比一個心機

【Top榜】如何優雅地形容一個人臉大?

內地精子庫斷貨,捐精「送」iPhone 6s

卡爾登城市酒店:城市空間的優雅回應

看設計 | 簡潔酷炫的成人概念滑板車

原創世界最罕見血型在中國現世,比熊貓血還稀有,已被國際基因庫收錄

一個心慌小舉動,鳳凰橋旁意外查出「隱形炸藥庫」

新疆獨庫公路通車時間推後

【瑪格麗】溫暖羽絨,優雅出眾

孫芸芸:臺灣名媛的優雅與愛情

如何優雅地說分手,中外明星分手文案大PK

招募 | 金山意庫*宜家:關於那些「可持續設計」

PyQuery:一個爬蟲界最簡潔優雅的庫

相關焦點

蟲術:初入爬蟲界

西子閣:從奢華繁複還原低調簡潔

最後的糧倉: 末日種子庫

C語言:優雅的字符串函數庫

圖文:北極圈種子庫

Wedding | J'adore 法式優雅,浪漫主義最完美的婚禮

【末日種子庫】大門開啟!

南海意庫:綠色的秘密

寶貝討喜的過年新衣LOOK,媽媽們一個比一個心機

【Top榜】如何優雅地形容一個人臉大?

內地精子庫斷貨,捐精「送」iPhone 6s

卡爾登城市酒店:城市空間的優雅回應

看設計 | 簡潔酷炫的成人概念滑板車

原創 世界最罕見血型在中國現世,比熊貓血還稀有,已被國際基因庫收錄

一個心慌小舉動,鳳凰橋旁意外查出「隱形炸藥庫」

新疆獨庫公路通車時間推後

【瑪格麗】溫暖羽絨,優雅出眾

孫芸芸:臺灣名媛的優雅與愛情

如何優雅地說分手,中外明星分手文案大PK

招募 | 金山意庫*宜家:關於那些「可持續設計」

原創世界最罕見血型在中國現世,比熊貓血還稀有,已被國際基因庫收錄