Python爬蟲庫xPath, BeautifulSoup, re, selenium的詳細用法

2021-12-29 今日在學

收錄於話題 #Python學習之路 10個

項目代碼展示部分代碼項目部署方法, 已發GitHub, 項目地址github地址[1]使用技術正則匹配有幾種正則匹配的方法:match, search, compile, findall, finditerre.match(a, b, c)三個參數: 匹配的規則, 要匹配的字符串, 匹配方式它是從字符串的第一個位置進行匹配如果滿足使用.span()方法可以返回它所在的索引位置, 如果不滿足則返回None返回結果.groups()返回一個包含所有小組的字符串的元組, 使用group(num)方法可以返回一個包含對應值的元組(從1開始)re.search(a, b, c)獲取元組方法也相同, 唯一的不同點就是, search不是從一開始進行匹配, 而是如果字符串中包含所要匹配的內容, 則返回第一個匹配成功的re.sub(a, b, c, d, e)d匹配後替換的最大次數, 默認是0表示全部匹配替換re.compile(a, b)用來編譯正則表達式, 供match和search這兩個函數使用如果使用的是match方法在獲取匹配的字符時使用group方法獲取參數可以省略不寫也可以寫0group方法參數的數值與你所寫的正則表達式元組數有關start, end, span方法都是返回匹配字符在原字符串中所在的索引位置findall(a, b, c, d)參數分別表示: 正則表達式, 匹配的字符串, 指定匹配的起始位置, 結束位置返回滿足條件的所有子串, 列表的形式, 如不則返回空列表finditer(a, b, c)參數分別為: 匹配規則, 匹配的字符串, 匹配模式和findall方法類似, 返回值使用迭代器方式返回使用for in方法re.split(a, b, c, d)按照匹配規則將匹配的字符串進行分隔以列表的形式返回參數分別為: 匹配規則, 匹配字符, 切割次數默認為0, 不限制次數, 匹配模式正則表達式修飾符re.U 根據Unicode字符集解析字符. 這個標誌影響 \w, \W, \b, \B.re.X 該標誌通過給予你更靈活的格式以便你將正則表達式寫得更易於理解.

【Python正則表達式詳解 (超詳細, 看完必會!)[2]】

xPath方法可以對本地的html文件進行解析也可以直接對html字符串進行解析Xpath常用的規則本地展示以下代碼中用到該實例

# coding= utf-8
from lxml import etree
html = etree.parse('./index.html',etree.HTMLParser())
print(etree.tostring(html))
# coding= utf-8
from lxml import etree
fp = open('./index.html', 'rb')
html = fp.read().decode('utf-8')  
selector = etree.HTML(html)   #etree.HTML(源碼) 識別為可被xpath解析的對象
print(selector)
文本獲取兩種方法 /text() 和 //text(), 區別第一種直接獲取文本, 第二種要獲取換行時產生的特殊字符獲取屬性中包含多個值的情況 屬性多值匹配 contains()方法多屬性匹配, 使用and運算符 和contains方法搭配使用xPath運算符按序選擇xPath內置了100多種函數方法, 具體參考【xPath函數[3]】按序節點軸選擇獲取當前節點所有子元素的a節點的href屬性值 child::a/@href獲取當前節點的指定元素的屬性值 attribute:: 屬性名獲取當前節點 的所有屬性的屬性值 attribute::*獲取當前節點所有子節點 child::node()獲取當前元素所有文本子節點 child::text()獲取當前元素的所有父輩為li元素的節點(包括當前元素) ancestor-or-self:: 元素xPath軸[4]xPath避坑指南[5]
xPath軸演示代碼# coding= utf-8
from lxml import etree
# fp = open('./index.html', 'rb')
# html = fp.read().decode('utf-8')   #.decode('gbk')
# selector = etree.HTML(html)   #etree.HTML(源碼) 識別為可被xpath解析的對象
# print(selector)

html = etree.parse('./index.html',etree.HTMLParser())
# print(etree.tostring(html).decode('utf-8'))

all_node = html.xpath('//*')  # 所有節點的獲取 //*
part_node = html.xpath('//li')  # 部分節點 格式：//節點名
child_node = html.xpath('//li/a')  # 匹配子節點
parent_node = html.xpath('//a[@href="//mr90.top"]/../@class')  # 獲取父節點屬性值的方法 ../@屬性名
attrs_node = html.xpath('//a[contains(@class,"a")]/text()')   # 獲取屬性中包含多個值的情況 屬性多值匹配 contains()方法
# 按序獲取
first_node = html.xpath('//li[1]/a/text()')  # 獲取第一個
last_node = html.xpath('//li[last()]//text()')   # 獲取最後一個節點
front_node = html.xpath('//li[position()<3]//text()')    # 獲取前兩個節點
end_ndoe = html.xpath('//li[last()-2]//text()')   # 獲取到數第三個節點

# 軸節點
child_node_z = html.xpath('//li[position()<2]/child::a/@href')  # 獲取當前節點所有子元素的a節點的href屬性值
attribute_node = html.xpath('//li[2]//attribute::lang')  # 獲取當前節點的指定元素的屬性值
all_child_node = html.xpath('//ul/li[last()-1]//child::*')  # 獲取當前節點的所有的文本節點
all_attrs_node = html.xpath('//li[1]/a/attribute::*')  # 獲取當前節點 的所有屬性的屬性值
all_child_text_node = html.xpath('//li[1]//child::text()')  # 獲取當前節點所有文本子節點
all_child_node_node = html.xpath('//li[1]/a/child::node()')  # 獲取當前節點所有子元素
ancestor_self = html.xpath('//a[@title="1"]/../ancestor-or-self::li') # 獲取當前元素的所有父輩為li元素的節點（包括當前元素）
print(ancestor_self)
Beautifulsoup4使用Beautiful Soup自動將輸入文檔轉換為Unicode編碼, 輸出文檔轉換為utf-8編碼使用前安裝 pip install beautifulsoup4引入from bs4 import Beautifulsoup4獲取內容
## 獲取標題對象

print(soup.title)  # <title>xPath方法</title>
# 獲取標題內容
print(soup.title.string)  # 返回迭代器
print(soup.title.text)
print(soup.title.get_text())
print(soup.find('title').get_text())
# print(soup.title.parent)   # 返回父節點包括父節點中的內容
print(soup.li.child)  # Node
print(soup.li.children)  # 返回一個迭代器
獲取第一個li標籤print(soup.li.get_text())  # 匹配到第一個，返回所有節點的文本信息
print(soup.find('li').text)
# 獲取ul的子標籤們   (空行也看成了一個children)
print(soup.ul.children)
for index, item in enumerate(soup.ul.children):
    print(index, item)
獲取元素的屬性使用元素.attrs['屬性名']的方法返回的時一個列表如果使用兩次 soup.元素 第一次獲取的是匹配到的第一個元素, 第二次是匹配到的第二個元素獲取多個元素find_all獲取多個元素, 可以加上limit來達到限制個數的問題,  recursive = True 尋找子孫 ; recursive = False只找子多層級查找 find_all返回的是一個列表 可以遍歷該列表再次使用find方法或者find_all方法 進行元素的獲取通過指定的屬性, 獲取對象id和class選擇器, class比較特殊, 因為是關鍵字 在使用class時改成class_print(soup.find(id='a'))
print(soup.find('a', id='a'))
print(soup.find_all('a', id='a'))  # 可以使用下標查詢

# class是關鍵字 要這麼寫class_

print('class1', soup.find_all('a', class_='a'))
print('class2', soup.find_all('a', attrs={'class': 'item'}))  # 更通用
print('class3', soup.find_all('a', attrs={'class': 'item', 'id': 'a'}))  # 多條件
使用函數作為參數, 返回元素def judgeTilte1(t):
    if t == 'a':
        return True

print(soup.find_all(class_=judgeTilte1))
# 判斷長度
import re  # 正則表達式
reg = re.compile("item")
def judgeTilte2(t):
    # 返回長度為5，且包含'item'的t參數
    return len(str(t)) == 5 and bool(re.search(reg, t))
print(soup.find_all(class_=judgeTilte2))
可以使用css選擇器可以通過標籤名查找, 屬性查找, 標籤+類名+id, 組合查找Python中BeautifulSoup庫的用法[6]python beautiful soup庫的超詳細用法[7]python 爬蟲 提取文本之BeautifulSoup詳細用法[8]
參考資料[1]github地址: https://github.com/Rr210/hot_search
[2]Python正則表達式詳解 (超詳細, 看完必會!): https://blog.csdn.net/weixin_43347550/article/details/105158003
[3]xPath函數: http://www.w3school.com.cn/xpath/xpath_functions.asp
[4]xPath軸: https://www.w3school.com.cn/xpath/xpath_axes.asp
[5]xPath避坑指南: https://blog.csdn.net/Ryan_lee9410/article/details/107144213
[6]Python中BeautifulSoup庫的用法: https://blog.csdn.net/qq_21933615/article/details/81171951
[7]python beautiful soup庫的超詳細用法: https://blog.csdn.net/love666666shen/article/details/77512353
[8]python 爬蟲 提取文本之BeautifulSoup詳細用法: https://blog.csdn.net/IT_arookie/article/details/82824620

Python爬蟲庫xPath, BeautifulSoup, re, selenium的詳細用法

相關焦點

Python爬蟲庫-Beautiful Soup的使用

Python爬蟲系列:BeautifulSoup庫詳解

python爬蟲之BeautifulSoup

Python 爬蟲基礎教程——BeautifulSoup抓取入門

BeautifulSoup 詳細知識(一)

python爬蟲常用庫之BeautifulSoup詳解

BeautifulSoup 十分鐘快速上手指南

selenium輔助學習

【Python】Python爬蟲快速入門,BeautifulSoup基本使用及實踐

Python筆記 | 使⽤ Beautiful Soup 解析數據

Python爬蟲快速入門,BeautifulSoup基本使用及實踐

Python爬蟲從入門到精通(3): BeautifulSoup用法總結及多線程爬蟲爬取糗事百科

Python 爬蟲之 BeautifulSoup

python爬蟲系列二: Beautiful Soup庫學習筆記

Beautiful Soup的安裝和使用

HTML解析之BeautifulSoup

Python改變生活 | 利用Selenium實現網站自動籤到

Python3中BeautifulSoup的使用方法

使用Selenium和BeautifulSoup爬取職位發布數據