上次分享了利用瀏覽器抓包獲得 cookie 使爬蟲可以突破網站的登錄校驗
本文以 chrome 瀏覽器為例,演示如何讀取瀏覽器存儲在磁碟上的 cookie 用於爬蟲
chrome 使用 sqlite 資料庫保存 cookie 文件,資料庫文件路徑是 %LOCALAPPDATA%\Google\Chrome\User Data\Default\Cookies
echo %LOCALAPPDATA%\Google\Chrome\User Data\Default\CookiesC:\Users\refusea\AppData\Local\Google\Chrome\User Data\Default\Cookies
用 sqlite 查看下表結構
# -*- coding: utf-8 -*-import osimport sqlite3if __name__ == '__main__': cookie_file = os.environ['LOCALAPPDATA']+ '\\Google\\Chrome\\User Data\\Default\\Cookies' conn = sqlite3.connect(cookie_file) cursor = conn.cursor() cursor.execute('select * from sqlite_master where type="table" and name="cookies"') for row in cursor: print(row) cursor.close() conn.close()
CREATE TABLE cookies( creation_utc INTEGER NOT NULL, host_key TEXT NOT NULL, name TEXT NOT NULL, value TEXT NOT NULL, path TEXT NOT NULL, expires_utc INTEGER NOT NULL, is_secure INTEGER NOT NULL, is_httponly INTEGER NOT NULL, last_access_utc INTEGER NOT NULL, has_expires INTEGER NOT NULL DEFAULT 1, is_persistent INTEGER NOT NULL DEFAULT 1, priority INTEGER NOT NULL DEFAULT 1, encrypted_value BLOB DEFAULT '', samesite INTEGER NOT NULL DEFAULT -1, source_scheme INTEGER NOT NULL DEFAULT 0, UNIQUE (host_key, name, path))
各個欄位含義沒有找到文檔,對照 chrome 的 cookie 查看界面倒也不難猜測
查看 cookie
以下 sql 就可以取到 baidu.com 域名下的 所有 cookie 了
select name, path, encrypted_value from cookies where host_key='.baidu.com'
注意
根據密文的前幾個字節判斷加密方式
這是 windows 系統內置的一個數據保護接口,可對數據進行加解密的操作
AES 的密鑰存放在 `%LOCALAPPDATA%\Google\Chrome\User Data\Local State`,該文件格式為 json
type "%LOCALAPPDATA%\Google\Chrome\User Data\Local State"
密鑰的路徑是 /os_crypt/encrypted_key
AES 密鑰
其值是 base64 編碼的
以下給出從本機瀏覽器獲取 cookie 的 python 代碼實現供參考
# -*- coding: utf-8 -*-import sqlite3import osimport jsonimport base64from cryptography.hazmat.backends import default_backendfrom cryptography.hazmat.primitives.ciphers import Cipher, algorithms, modesdef dpapi_decrypt(encrypted_value): import ctypes import ctypes.wintypes class DATA_BLOB(ctypes.Structure): _fields_ = [('cbData', ctypes.wintypes.DWORD), ('pbData', ctypes.POINTER(ctypes.c_char))] p = ctypes.create_string_buffer(encrypted_value, len(encrypted_value)) blobin = DATA_BLOB(ctypes.sizeof(p), p) blobout = DATA_BLOB() retval = ctypes.windll.crypt32.CryptUnprotectData( ctypes.byref(blobin), None, None, None, None, 0, ctypes.byref(blobout)) if not retval: raise ctypes.WinError() result = ctypes.string_at(blobout.pbData, blobout.cbData) ctypes.windll.kernel32.LocalFree(blobout.pbData) return resultdef aes_decrypt(encrypted_value): with open(os.path.join(os.environ['LOCALAPPDATA'], r"Google\Chrome\User Data\Local State"), mode="r", encoding='utf-8') as f: jsn = json.loads(str(f.readline())) encoded_key = jsn["os_crypt"]["encrypted_key"] encrypted_key = base64.b64decode(encoded_key.encode()) encrypted_key = encrypted_key[5:] key = dpapi_decrypt(encrypted_key) nonce = encrypted_value[3:15] cipher = Cipher(algorithms.AES(key), None, backend=default_backend()) cipher.mode = modes.GCM(nonce) decryptor = cipher.decryptor() return decryptor.update(encrypted_value[15:])def decrypt(encrypted_value): try: if encrypted_value[:4] == b'x01x00x00x00': value = dpapi_decrypt(encrypted_value) return value.decode() elif encrypted_value[:3] == b'v10': value = aes_decrypt(encrypted_value) return value[:-16].decode() except WindowsError: return Nonedef get_cookie_from_chrome(domain): file = os.path.join(os.environ['USERPROFILE'], r'AppData\Local\Google\Chrome\User Data\default\Cookies') conn = sqlite3.connect(file) cursor = conn.cursor() sql = "SELECT name, encrypted_value FROM cookies where host_key='{}'".format(domain) cursor.execute(sql) cookie = '' for row in cursor: value = row[1] if value is not None: name = row[0] value = decrypt(value) if value is not None: cookie += name + '=' + value + ';' cursor.close() conn.close() return cookieif __name__ == '__main__': print(get_cookie_from_chrome('.baidu.com'))
使用方法,還是以 gitee 為例,如下
# -*- coding: utf-8 -*-import requestsfrom chrome_cookie import get_cookie_from_chromesession = requests.session()if __name__ == '__main__': headers = { 'Host': 'gitee.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36', 'Accept': 'application/json', 'Cache-Control': 'no-cache', 'Connection': 'keep-alive', } url = 'https://gitee.com/api/v3/internal/my_resources' result = session.get(url=url, headers=headers) headers['Cookie'] = get_cookie_from_chrome('.gitee.com') result = session.get(url=url, headers=headers) if result.status_code == 200: print('success: \n%s' % (result.text)) else: print(result.status_code)
ok,現在做爬蟲就簡單了,再厲害的驗證碼也攔不住我了。。。只要用瀏覽器先登錄即可,是不是很方便呢?