一文告訴你,如何使用Python構建一個「谷歌搜索」系統|內附代碼

2020-12-11 AI科技大本營

來源 | hackernoon

編譯 | 武明利

責編 | Carol

在這篇文章中，我將向您展示如何使用Python構建自己的答案查找系統。基本上，這種自動化可以從圖片中找到多項選擇題的答案。

有一件事我們要清楚，在考試期間不可能在網際網路上搜索問題，但是當考官轉過身去的時候，我可以很快地拍一張照片。這是算法的第一部分。我得想辦法把這個問題從圖中提取出來。

似乎有很多服務可以提供文本提取工具，但是我需要某種API來解決此問題。最後，Google的VisionAPI正是我正在尋找的工具。很棒的事情是，每月前1000個API調用是免費的，這足以讓我測試和使用該API。

Vision AI

首先，創建Google雲帳戶，然後在服務中搜索Vision AI。使用VisionAI，您可以執行諸如為圖像分配標籤來組織圖像，獲取推薦的裁切頂點，檢測著名的風景或地方，提取文本等工作。

檢查文檔以啟用和設置API。配置後，您必須創建JSON文件，包含您下載到計算機的密鑰。

運行以下命令安裝客戶端庫：

pip install google-cloud-vision然後通過設置環境變量GOOGLE_APPLICATION_CREDENTIALS，為應用程式代碼提供身份驗證憑據。

import os, iofrom google.cloud import visionfrom google.cloud.vision import types# JSON file that contains your keyos.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'your_private_key.json'# Instantiates a clientclient = vision.ImageAnnotatorClientFILE_NAME = 'your_image_file.jpg'# Loads the image into memorywith io.open(os.path.join(FILE_NAME), 'rb') as image_file: content = image_file.readimage = vision.types.Image(content=content)# Performs text detection on the image fileresponse = client.text_detection(image=image)print(response)# Extract descriptiontexts = response.text_annotations[0]print(texts.description)在運行代碼時，您將看到JSON格式的響應，其中包括檢測到的文本的規範。但我們只需要純描述，所以我從響應中提取了這部分。

在Google上搜索問題

下一步是在Google上搜索問題部分來獲得一些信息。我使用正則表達式（regex）庫從描述（響應）中提取問題部分。然後我們必須將提取出的問題部分進行模糊化，以便能夠對其進行搜索。

import reimport urllib# If ending with question markif '?' in texts.description: question = re.search('([^?]+)', texts.description).group(1)# If ending with colonelif ':' in texts.description: question = re.search('([^:]+)', texts.description).group(1)# If ending with newlineelif '\n' in texts.description: question = re.search('([^\n]+)', texts.description).group(1)# Slugify the matchslugify_keyword = urllib.parse.quote_plus(question)print(slugify_keyword)

抓取的信息

我們將使用 BeautifulSoup 抓取前3個結果，以獲得關於問題的一些信息，因為答案可能位於其中之一。

另外，如果您想從Google的搜索列表中抓取特定的數據，不要使用inspect元素來查找元素的屬性，而是列印整個頁面來查看屬性，因為它與實際的屬性有所不同。

我們需要對搜索結果中的前3個連結進行抓取，但是這些連結確實被弄亂了，因此獲取用於抓取的乾淨連結很重要。

/url?q=https://en.wikipedia.org/wiki/IAU_definition_of_planet&sa=U&ved=2ahUKEwiSmtrEsaTnAhXtwsQBHduCCO4QFjAAegQIBBAB&usg=AOvVaw0HzMKrBxdHZj5u1Yq1t0en正如您所看到的，實際的連結位於q=和&sa之間。通過使用正則表達式Regex，我們可以獲得這個特定的欄位或有效的URL。

result_urls = def crawl_result_urls: req = Request('https://google.com/search?q=' + slugify_keyword, headers={'User-Agent': 'Mozilla/5.0'}) html = urlopen(req).read bs = BeautifulSoup(html, 'html.parser') results = bs.find_all('div', class_='ZINbbc') try: for result in results: link = result.find('a')['href'] # Checking if it is url (in case) if 'url' in link: result_urls.append(re.search('q=(.*)&sa', link).group(1)) except (AttributeError, IndexError) as e: pass在我們抓取這些URLs的內容之前，讓我向您展示使用Python的問答系統。

問答系統

這是算法的主要部分。從前3個結果中抓取信息後，程序應該通過迭代文檔來檢測答案。首先，我認為最好使用相似度算法來檢測與問題最相似的文檔，但是我不知道如何實現它。

經過幾個小時的研究，我在Medium上找到了一篇文章，用Python解釋了問答系統。它有易於使用的python軟體包能夠對您自己的私有數據實現一個QA系統。

讓我們先安裝這個包：

pip install cdqa我正在使用下面的示例代碼塊中包含的下載功能來手動下載經過預訓練的模型和數據：

import pandas as pdfrom ast import literal_evalfrom cdqa.utils.filters import filter_paragraphsfrom cdqa.utils.download import download_model, download_bnpp_datafrom cdqa.pipeline.cdqa_sklearn import QAPipeline# Download data and modelsdownload_bnpp_data(dir='./data/bnpp_newsroom_v1.1/')download_model(model='bert-squad_1.1', dir='./models')# Loading data and filtering / preprocessing the documentsdf = pd.read_csv('data/bnpp_newsroom_v1.1/bnpp_newsroom-v1.1.csv', converters={'paragraphs': literal_eval}) df = filter_paragraphs(df)# Loading QAPipeline with CPU version of BERT Reader pretrained on SQuAD 1.1 cdqa_pipeline = QAPipeline(reader='models/bert_qa.joblib')# Fitting the retriever to the list of documents in the dataframecdqa_pipeline.fit_retriever(df)# Sending a question to the pipeline and getting predictionquery = 'Since when does the Excellence Program of BNP Paribas exist?'prediction = cdqa_pipeline.predict(query)print('query: {}\n'.format(query))print('answer: {}\n'.format(prediction[0]))print('title: {}\n'.format(prediction[1]))print('paragraph: {}\n'.format(prediction[2]))它的輸出應該是這樣的:

它列印出確切的答案和包含答案的段落。

基本上，當從圖片中提取問題並將其發送到系統時，檢索器將從已抓取數據中選擇最有可能包含答案的文檔列表。如前所述，它計算問題與抓取數據中每個文檔之間的餘弦相似度。

在選擇了最可能的文檔後，系統將每個文檔分成幾個段落，並將問題一起發送給讀者，這基本上是一個預先訓練好的深度學習模型。所使用的模型是著名的NLP模型BERT的Pytorch 版本。

然後，讀者輸出在每個段落中找到的最可能的答案。在閱讀者之後，系統中的最後一層通過使用內部評分函數對答案進行比較，並根據分數輸出最有可能的答案，這將得到我們問題的答案。

下面是系統機制的模式。

你必須在特定的結構中設置數據幀(CSV)，以便將其發送到 cdQA 管道。

但是實際上我使用PDF轉換器從PDF文件目錄創建了一個輸入數據框。因此，我要在pdf文件中保存每個結果的所有抓取數據。我們希望總共有3個pdf文件(也可以是1個或2個)。另外，我們需要命名這些pdf文件，這就是為什麼我抓取每個頁面的標題的原因。

def get_result_details(url): try: req = Request(url, headers={'User-Agent': 'Mozilla/5.0'}) html = urlopen(req).read bs = BeautifulSoup(html, 'html.parser') try: # Crawl any heading in result to name pdf file title = bs.find(re.compile('^h[1-6]$')).get_text.strip.replace('?', '').lower # Naming the pdf file filename = "/home/coderasha/autoans/pdfs/" + title + ".pdf" if not os.path.exists(os.path.dirname(filename)): try: os.makedirs(os.path.dirname(filename)) except OSError as exc: # Guard against race condition if exc.errno != errno.EEXIST: raise with open(filename, 'w') as f: # Crawl first 5 paragraphs for line in bs.find_all('p')[:5]: f.write(line.text + '\n') except AttributeError: pass except urllib.error.HTTPError: passdef find_answer: df = pdf_converter(directory_path='/home/coderasha/autoans/pdfs') cdqa_pipeline = QAPipeline(reader='models/bert_qa.joblib') cdqa_pipeline.fit_retriever(df) query = question + '?' prediction = cdqa_pipeline.predict(query) print('query: {}\n'.format(query)) print('answer: {}\n'.format(prediction[0])) print('title: {}\n'.format(prediction[1])) print('paragraph: {}\n'.format(prediction[2])) return prediction[0]我總結一下算法：它將從圖片中提取問題，在Google上搜索它，抓取前3個結果，從抓取的數據中創建3個pdf文件，最後使用問答系統找到答案。

如果你想看看它是如何工作的，請檢查我做的一個可以從圖片中解決考試問題的機器人。

以下是完整的代碼：

import os, ioimport errnoimport urllibimport urllib.requestimport hashlibimport reimport requestsfrom time import sleepfrom google.cloud import visionfrom google.cloud.vision import typesfrom urllib.request import urlopen, Requestfrom bs4 import BeautifulSoupimport pandas as pdfrom ast import literal_evalfrom cdqa.utils.filters import filter_paragraphsfrom cdqa.utils.download import download_model, download_bnpp_datafrom cdqa.pipeline.cdqa_sklearn import QAPipelinefrom cdqa.utils.converters import pdf_converterresult_urls = os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'your_private_key.json'client = vision.ImageAnnotatorClientFILE_NAME = 'your_image_file.jpg'with io.open(os.path.join(FILE_NAME), 'rb') as image_file: content = image_file.readimage = vision.types.Image(content=content)response = client.text_detection(image=image)texts = response.text_annotations[0]# print(texts.description)if '?' in texts.description: question = re.search('([^?]+)', texts.description).group(1)elif ':' in texts.description: question = re.search('([^:]+)', texts.description).group(1)elif '\n' in texts.description: question = re.search('([^\n]+)', texts.description).group(1)slugify_keyword = urllib.parse.quote_plus(question)# print(slugify_keyword)def crawl_result_urls: req = Request('https://google.com/search?q=' + slugify_keyword, headers={'User-Agent': 'Mozilla/5.0'}) html = urlopen(req).read bs = BeautifulSoup(html, 'html.parser') results = bs.find_all('div', class_='ZINbbc') try: for result in results: link = result.find('a')['href'] print(link) if 'url' in link: result_urls.append(re.search('q=(.*)&sa', link).group(1)) except (AttributeError, IndexError) as e: passdef get_result_details(url): try: req = Request(url, headers={'User-Agent': 'Mozilla/5.0'}) html = urlopen(req).read bs = BeautifulSoup(html, 'html.parser') try: title = bs.find(re.compile('^h[1-6]$')).get_text.strip.replace('?', '').lower # Set your path to pdf directory filename = "/path/to/pdf_folder/" + title + ".pdf" if not os.path.exists(os.path.dirname(filename)): try: os.makedirs(os.path.dirname(filename)) except OSError as exc: if exc.errno != errno.EEXIST: raise with open(filename, 'w') as f: for line in bs.find_all('p')[:5]: f.write(line.text + '\n') except AttributeError: pass except urllib.error.HTTPError: passdef find_answer: # Set your path to pdf directory df = pdf_converter(directory_path='/path/to/pdf_folder/') cdqa_pipeline = QAPipeline(reader='models/bert_qa.joblib') cdqa_pipeline.fit_retriever(df) query = question + '?' prediction = cdqa_pipeline.predict(query) # print('query: {}\n'.format(query)) # print('answer: {}\n'.format(prediction[0])) # print('title: {}\n'.format(prediction[1])) # print('paragraph: {}\n'.format(prediction[2])) return prediction[0]crawl_result_urlsfor url in result_urls[:3]: get_result_details(url) sleep(5)answer = find_answerprint('Answer: ' + answer)有時它可能會混淆，但我認為總體來說是可以的。至少我可以用60%的正確答案通過考試。

歡迎開發者們在評論中告訴我你的看法！實際上，最好是一次遍歷所有問題，但我沒有足夠的時間來做這件事，所以只好下次繼續再做。

（*本文由AI科技大本營編譯，轉載請聯繫微信1092722531）

【end】

福利直達！CSDN技術公開課評選進行中直播進行中 | 技術馳援抗疫一線， Python 線上峰會全天精彩呈現分布式數據集訓營，從入門到精通，從理論到實踐，你不可錯過的精品課程！區塊鏈的陰暗面QQ 群文件緊急擴容；鍾南山團隊與阿里雲聯手推進新冠疫苗研發；PhpStorm 2019.3.3 發布願得一心人：矽谷億萬富豪們的婚姻怎樣？有人白首相守七十年

一文告訴你,如何使用Python構建一個「谷歌搜索」系統|內附代碼

相關焦點

如何使用 Python 構建一個「谷歌搜索」系統? | 內附代碼

谷歌開放GNMT教程:如何使用TensorFlow構建自己的神經機器翻譯系統

10個使用Python構建的著名網站

代碼詳解:Python虛擬環境的原理及使用

使用 Python 構建圖片搜尋引擎

一文總結數據科學家常用的Python庫(下)

如何創建一個百分百懂你的產品推薦系統|深度教程(附代碼詳解)

谷歌開源 Python Fire;一張圖讀懂 Python、R 的大數據應用等 | AI...

好用的PYTHON IDE和代碼編輯器| TOP10推薦

從零開始實現穿衣圖像分割完整教程(附python代碼演練)

谷歌又一開源利器Atheris,Python代碼安全掃描工具

使用Python構建一個推薦系統需要幾步

工具&方法 | 6行代碼教你用python做OLS回歸(內附CFPS實例)

我們現在如何使用谷歌搜索

如何自學成 Python 大神?這裡有些建議

我們生活在「Python時代」

手把手教你用python搶票回家過年 !(附代碼)

如何到達谷歌搜索的頂部 [互動指南]-megalithant

谷歌高效開發的秘密:來自谷歌前員工的軟體開發工具指南

獨家 | 一文讀懂隨機森林的解釋和實現(附python代碼)