【一起學爬蟲】BeautifulSoup庫詳解

2021-12-29 菜鳥名企夢

回顧

上一次介紹正則表達式的時候，分享了一個爬蟲實戰，即爬取豆瓣首頁所有的：書籍、連結、作者、出版日期等。在上個實戰中我們是通過正則表達式來解析源碼爬取數據，整體來說上次實戰中的正則表達式是比較複雜的，所以引入了今天的主角BeautifulSoup：它是靈活方便的網頁解析庫，處理高效，而且支持多種解析器。使用Beautifulsoup，不用編寫正則表達式就可以方便的實現網頁信息的提取。

一、 BeautifulSoup的安裝

pip install beautifulsoup4

二、用法講解1. 解析庫

解析器

使用方法

優勢

劣勢

Python標準庫

BeautifulSoup(markup, "html.parser")

Python的內置標準庫、執行速度適中、文檔容錯能力強

Python 2.7.3 or 3.2.2)前的版本中文容錯能力差

lxml HTML 解析器

BeautifulSoup(markup, "lxml")

速度快、文檔容錯能力強，常用

需要安裝C語言庫 lxml

lxml XML 解析器

BeautifulSoup(markup, "xml")

速度快、唯一支持XML的解析器

需要安裝C語言庫

html5lib

BeautifulSoup(markup, "html5lib")

最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔

速度慢、不依賴外部擴展

2.基本使用

下面是一個不完整的html：body標籤、html標籤都沒有閉合

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

下面使用lxml解析庫解析上面的html

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)

下面是容錯處理時標籤補全後的結果和獲取的title內容，可以看到html和body標籤都被補全了：

<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title" name="dromouse">
<b>
The Dormouse's story
</b>
</p >
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href=" " id="link1">

</ a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</ a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</ a>
;
and they lived at the bottom of a well.
</p >
<p class="story">
...
</p >
</body>
</html>
The Dormouse's story

3.標籤選擇器（1）選擇元素

依舊使用上面的html

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)

結果是：

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p >

從結果發現只輸出了一個p標籤，但是HTML中有3個p標籤
標籤選擇器的特性：當有多個標籤的時候，它只返回第一個標籤的內容

（2）獲取屬性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

輸出結果：

dromouse
dromouse

(3) 獲取內容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.string)

輸出結果：

The Dormouse's story

(4) 嵌套獲取屬性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title.string)

輸出：

The Dormouse's story

(5)獲取子節點和子孫節點

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)

輸出的是一個列表

['\n Once upon a time there were three little sisters; and their names were\n ',
<a class="sister" href=" " id="link1">
<span>Elsie</span>
</ a>,
'\n'
, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a>
, ' \n and\n '
, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>
, '\n and they lived at the bottom of a well.\n ']

另外一種獲取方式

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
print(i, child)

輸出：

<list_iterator object at 0x1064f7dd8>
0
Once upon a time there were three little sisters; and their names were
　
1 <a class="sister" href=" " id="link1">
<span>Elsie</span>
</ a>
2
　
3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a>
4
and　　　
5 　　　
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>　　　　　　　　　　　
6
and they lived at the bottom of a well.

（6）獲取父節點

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)

程序列印出的是p標籤，即a標籤的父節點：

<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href=" " id="link1">
<span>Elsie</span>
</ a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>
and they lived at the bottom of a well.
</p >

於此類似的還有：

上面是標籤選擇器：處理速度很快，但是這種方式不能滿足我們解析HTML的需求。因此beautifulsoup還提供了一些其他的方法

4.標準選擇器

**find_all( name , attrs , recursive , text , kwargs )
可根據標籤名、屬性、內容查找文檔
下面使用的測試HTML都是下面這個

html='''
<div>
<div>
<h4>Hello</h4>
</div>
<div>
<ul id="list-1">
<li>Foo</li>
<li>Bar</li>
<li>Jay</li>
</ul>
<ul id="list-2">
<li>Foo</li>
<li>Bar</li>
</ul>
</div>
</div>
'''

(1) 根據標籤名，即name查找

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))

輸出了所有的ul標籤：

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>

上述可以繼續進行嵌套：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
print(ul.find_all('li'))

（2）根據屬性名進行查找

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(name='element'))

輸出：

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li>Jay</li>, <li class="element">Foo</li>, <li>Bar</li>]

(3)根據文本的內容，即text進行選擇

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))

輸出：

['Foo;'Foo']

返回的不是標籤，在查找的時候用途不大，更多是做內容匹配

find( name , attrs , recursive , text , kwargs )
和findall類似，只不過find方法只是返回單個元素

find_parents() find_parent()
find_parents()返回所有祖先節點，find_parent()返回直接父節點。

find_next_siblings() find_next_sibling()
find_next_siblings()返回後面所有兄弟節點，find_next_sibling()返回後面第一個兄弟節點。

find_previous_siblings() find_previous_sibling()
find_previous_siblings()返回前面所有兄弟節點，find_previous_sibling()返回前面第一個兄弟節點。

find_all_next() find_next()
find_all_next()返回節點後所有符合條件的節點, find_next()返回第一個符合條件的節點

find_all_previous() 和 find_previous()
find_all_previous()返回節點後所有符合條件的節點, find_previous()返回第一個符合條件的節點

5.CSS選擇器

通過select()直接傳入CSS選擇器即可完成選擇

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

輸出：

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>

也可以進行嵌套，不過沒必要，上面通過標籤之間使用空格就實現了嵌套：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
print(ul.select('li'))

輸出：

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

6.獲取到html後如何獲取屬性和內容：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
print(ul['id'])
獲取內容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
print(li.get_text())

7.總結

推薦使用lxml解析庫，必要時使用html.parser

標籤選擇篩選功能弱但是速度快

建議使用find()、find_all() 查詢匹配單個結果或者多個結果

如果對CSS選擇器熟悉建議使用select()，方便

記住常用的獲取屬性和文本值的方法

更多關於Beautifulsoup的使用可以查看對應的文檔說明

- End -

長按，掃碼，關注

及時收看更多精彩內容

博主：今日頭條大數據工程師

專註：求職面經源碼 java 大數據技術分享

點擊閱讀原文獲取4T基礎資料和5T精品資料

喜歡就給個「在看」

【一起學爬蟲】BeautifulSoup庫詳解

相關焦點

小白學 Python 爬蟲(21):解析庫 Beautiful Soup(上)

python爬蟲常用庫之BeautifulSoup詳解

python爬蟲系列二: Beautiful Soup庫學習筆記

「小白學爬蟲連載(5)」——Beautiful Soup庫詳解

Python爬蟲快速入門,BeautifulSoup基本使用及實踐

美女老師帶你做爬蟲:BeautifuSoup庫詳解及實戰!

Python爬蟲從入門到精通(3): BeautifulSoup用法總結及多線程爬蟲爬取糗事百科

BeautifulSoup解析html介紹

5分鐘快速學習掌握python爬蟲Beautifulsoup解析網頁

Python爬蟲擴展庫BeautifulSoup4用法精要

[Python從零到壹] 五.網絡爬蟲之BeautifulSoup基礎語法萬字詳解

BeautifulSoup庫是爬蟲中的神級庫!最詳細的入門教程!適用新手

爬蟲入門系列(四):HTML文本解析庫BeautifulSoup

BeautifulSoup

[Python從零到壹] 六.網絡爬蟲之BeautifulSoup爬取豆瓣TOP250電影詳解

Python 爬蟲之 BeautifulSoup

BeautifulSoup | 讓你一次性搞清楚BeautifulSoup!(上)

Python爬蟲html解析,還在用BeautifulSoup嗎?試試PyQuery吧

python程序媛BeautifulSoup快速入門

技術分享|利用Python和BeautifulSoup進行網頁爬取(新手教程)