上一次介紹正則表達式的時候,分享了一個爬蟲實戰,即爬取豆瓣首頁所有的:書籍、連結、作者、出版日期等。在上個實戰中我們是通過正則表達式來解析源碼爬取數據,整體來說上次實戰中的正則表達式是比較複雜的,所以引入了今天的主角BeautifulSoup:它是靈活方便的網頁解析庫,處理高效,而且支持多種解析器。使用Beautifulsoup,不用編寫正則表達式就可以方便的實現網頁信息的提取。
一、 BeautifulSoup的安裝pip install beautifulsoup4
二、用法講解1. 解析庫解析器
使用方法
優勢
劣勢
Python標準庫
BeautifulSoup(markup, "html.parser")
Python的內置標準庫、執行速度適中 、文檔容錯能力強
Python 2.7.3 or 3.2.2)前的版本中文容錯能力差
lxml HTML 解析器
BeautifulSoup(markup, "lxml")
速度快、文檔容錯能力強,常用
需要安裝C語言庫 lxml
lxml XML 解析器
BeautifulSoup(markup, "xml")
速度快、唯一支持XML的解析器
需要安裝C語言庫
html5lib
BeautifulSoup(markup, "html5lib")
最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔
速度慢、不依賴外部擴展
2.基本使用下面是一個不完整的html:body標籤、html標籤都沒有閉合
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
下面使用lxml解析庫解析上面的html
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)
下面是容錯處理時標籤補全後的結果和獲取的title內容,可以看到html和body標籤都被補全了:
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title" name="dromouse">
<b>
The Dormouse's story
</b>
</p >
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href=" " id="link1">
</ a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</ a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</ a>
;
and they lived at the bottom of a well.
</p >
<p class="story">
...
</p >
</body>
</html>
The Dormouse's story
依舊使用上面的html
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)
結果是:
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p >
從結果發現只輸出了一個p標籤,但是HTML中有3個p標籤
標籤選擇器的特性:當有多個標籤的時候,它只返回第一個標籤的內容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])
輸出結果:
dromouse
dromouse
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.string)
輸出結果:
The Dormouse's story
(4) 嵌套獲取屬性from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title.string)
輸出:
The Dormouse's story
(5)獲取子節點和子孫節點from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)
輸出的是一個列表
['\n Once upon a time there were three little sisters; and their names were\n ',
<a class="sister" href=" " id="link1">
<span>Elsie</span>
</ a>,
'\n'
, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a>
, ' \n and\n '
, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>
, '\n and they lived at the bottom of a well.\n ']
另外一種獲取方式
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
print(i, child)
輸出:
<list_iterator object at 0x1064f7dd8>
0
Once upon a time there were three little sisters; and their names were
1 <a class="sister" href=" " id="link1">
<span>Elsie</span>
</ a>
2
3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a>
4
and
5
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>
6
and they lived at the bottom of a well.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)
程序列印出的是p標籤,即a標籤的父節點:
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href=" " id="link1">
<span>Elsie</span>
</ a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>
and they lived at the bottom of a well.
</p >
於此類似的還有:
上面是標籤選擇器:處理速度很快,但是這種方式不能滿足我們解析HTML的需求。因此beautifulsoup還提供了一些其他的方法
4.標準選擇器**find_all( name , attrs , recursive , text , kwargs )
可根據標籤名、屬性、內容查找文檔
下面使用的測試HTML都是下面這個
html='''
<div>
<div>
<h4>Hello</h4>
</div>
<div>
<ul id="list-1">
<li>Foo</li>
<li>Bar</li>
<li>Jay</li>
</ul>
<ul id="list-2">
<li>Foo</li>
<li>Bar</li>
</ul>
</div>
</div>
'''
(1) 根據標籤名,即name查找
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
輸出了所有的ul標籤:
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>
上述可以繼續進行嵌套:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
print(ul.find_all('li'))
(2)根據屬性名進行查找
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(name='element'))
輸出:
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li>Jay</li>, <li class="element">Foo</li>, <li>Bar</li>]
(3)根據文本的內容,即text進行選擇
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))
輸出:
['Foo;'Foo']
返回的不是標籤,在查找的時候用途不大,更多是做內容匹配
find( name , attrs , recursive , text , kwargs )
和findall類似,只不過find方法只是返回單個元素
find_parents() find_parent()
find_parents()返回所有祖先節點,find_parent()返回直接父節點。
find_next_siblings() find_next_sibling()
find_next_siblings()返回後面所有兄弟節點,find_next_sibling()返回後面第一個兄弟節點。
find_previous_siblings() find_previous_sibling()
find_previous_siblings()返回前面所有兄弟節點,find_previous_sibling()返回前面第一個兄弟節點。
find_all_next() find_next()
find_all_next()返回節點後所有符合條件的節點, find_next()返回第一個符合條件的節點
find_all_previous() 和 find_previous()
find_all_previous()返回節點後所有符合條件的節點, find_previous()返回第一個符合條件的節點
通過select()直接傳入CSS選擇器即可完成選擇
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))
輸出:
[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>
也可以進行嵌套,不過沒必要,上面通過標籤之間使用空格就實現了嵌套:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
print(ul.select('li'))
輸出:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
print(ul['id'])
獲取內容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
print(li.get_text())
推薦使用lxml解析庫,必要時使用html.parser
標籤選擇篩選功能弱但是速度快
建議使用find()、find_all() 查詢匹配單個結果或者多個結果
如果對CSS選擇器熟悉建議使用select(),方便
記住常用的獲取屬性和文本值的方法
更多關於Beautifulsoup的使用可以查看對應的文檔說明
- End -
長按,掃碼,關注
及時收看更多精彩內容
博主:今日頭條大數據工程師
專註:求職 面經 源碼 java 大數據技術分享
點擊閱讀原文獲取4T基礎資料和5T精品資料
喜歡就給個「在看」