使用BERT和TensorFlow構建搜尋引擎

2021-01-14 相約機器人

作者 | Denis Antyukhov

來源 | Medium

編輯 | 代碼醫生團隊

基於神經概率語言模型的特徵提取器，例如與多種下遊NLP任務相關的BERT提取特徵。因此它們有時被稱為自然語言理解（NLU）模塊。

這些特徵還可以用於基於實例的學習，其依賴於計算查詢與訓練樣本的相似性。為了證明這一點，將使用BERT特徵提取為文本構建最近鄰搜尋引擎。

這個實驗的計劃是：

獲得預先訓練的BERT模型檢查點

提取針對推理優化的子圖

使用tf.Estimator創建特徵提取器

用T-SNE和嵌入式投影儀探索向量空間

實現最近鄰搜尋引擎

用數學加速最近鄰查詢

示例：構建電影推薦系統

問題和解答

本指南中包含哪些內容？

本指南包含兩個實現：BERT文本特徵提取器和最近鄰居搜尋引擎。

這個指南是誰？

本指南對於有興趣使用BERT進行自然語言理解任務的研究人員非常有用。它也可以作為與tf.Estimator API接口的工作示例。

需要做些什麼？

對於熟悉TensorFlow的讀者來說，完成本指南大約需要30分鐘。

相關代碼

這個實驗的代碼可以在Colab中找到。另外，查看為BERT實驗設置的存儲庫：它包含獎勵內容。

https://colab.research.google.com/drive/1ra7zPFnB2nWtoAc0U5bLp0rWuPWb6vu4

https://github.com/gaphex/bert_experimental

第1步：獲得預先訓練的模型

從預先訓練的BERT檢查點開始。出於演示目的，將使用由Google工程師預先訓練的無框架英語模型。

為了配置和優化圖形以進行推理，將使用令人敬畏的bert-as-a-service存儲庫。此存儲庫允許通過TCP為遠程客戶端提供BERT模型。

擁有遠程BERT伺服器在多主機環境中是有益的。但是在實驗的這一部分中，將專注於創建一個本地（進程中）特徵提取器。如果希望避免客戶端 - 伺服器體系結構引入的額外延遲和潛在故障模式，這將非常有用。

現在下載模型並安裝包。

!wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip!unzip wwm_uncased_L-24_H-1024_A-16.zip!pip install bert-serving-server --no-deps

第2步：優化推理圖
通常要修改模型圖，必須進行一些低級TensorFlow編程。但是由於bert-as-a-service，可以使用簡單的CLI界面配置推理圖。
import osimport tensorflow as tf from bert_serving.server.graph import optimize_graphfrom bert_serving.server.helper import get_args_parser  MODEL_DIR = '/content/wwm_uncased_L-24_H-1024_A-16/' #@param {type:"string"}GRAPH_DIR = '/content/graph/' #@param {type:"string"}GRAPH_OUT = 'extractor.pbtxt' #@param {type:"string"}GPU_MFRAC = 0.2 #@param {type:"string"} POOL_STRAT = 'REDUCE_MEAN' #@param {type:"string"}POOL_LAYER = "-2" #@param {type:"string"}SEQ_LEN = "64" #@param {type:"string"} tf.gfile.MkDir(GRAPH_DIR) parser = get_args_parser()carg = parser.parse_args(args=['-model_dir', MODEL_DIR,                               "-graph_tmp_dir", GRAPH_DIR,                               '-max_seq_len', str(SEQ_LEN),                               '-pooling_layer', str(POOL_LAYER),                               '-pooling_strategy', POOL_STRAT,                               '-gpu_memory_fraction', str(GPU_MFRAC)]) tmpfi_name, config = optimize_graph(carg)graph_fout = os.path.join(GRAPH_DIR, GRAPH_OUT) tf.gfile.Rename(    tmpfi_name,    graph_fout,    overwrite=True)print("Serialized graph to {}".format(graph_fout))

有幾個參數需要注意。
 
對於每個文本樣本，BERT編碼層輸出一個形狀的張量[ sequence_len，encoder_dim ]，每個標記有一個向量。如果要獲得固定的表示，需要應用某種池。
 
POOL_STRAT參數定義應用於編碼層編號POOL_LAYER的池策略。默認值' REDUCE_MEAN '平均序列中所有標記的向量。當模型未經過微調時，此策略最適用於大多數句子級別的任務。另一個選項是NONE，在這種情況下根本不應用池。這對於命名實體識別或POS標記等單詞級任務非常有用。有關這些選項的詳細討論，請查看韓曉的博文。
https://hanxiao.github.io/2019/01/02/Serving-Google-BERT-in-Production-using-Tensorflow-and-ZeroMQ/
 
SEQ_LEN影響模型處理的序列的最大長度。較小的值將幾乎線性地增加模型推理速度。
 
運行上述命令將把模型圖和權重成GraphDef將被序列化到一個對象pbtxt在文件GRAPH_OUT。該文件通常小於預先訓練的模型，因為將刪除訓練所需的節點和變量。這導致了一個非常便攜的解決方案：例如序列化後英語模型只需要380 MB。
 
第3步：創建特徵提取器
現在將使用序列化圖形來使用tf.Estimator API構建特徵提取器。需要定義兩件事：input_fn和model_fn
 
input_fn管理將數據導入模型。這包括執行整個文本預處理管道和為BERT 準備feed_dict。
 
首先，將每個文本樣本轉換為包含INPUT_NAMES 中列出的必要功能的tf.Example實例。該bert_tokenizer對象包含WordPiece詞彙和執行文本預處理。之後，示例將按照feed_dict中的功能名稱進行重新分組。
INPUT_NAMES = ['input_ids', 'input_mask', 'input_type_ids']bert_tokenizer = FullTokenizer(VOCAB_PATH) def build_feed_dict(texts):        text_features = list(convert_lst_to_features(        texts, SEQ_LEN, SEQ_LEN,         bert_tokenizer, log, False, False))     target_shape = (len(texts), -1)     feed_dict = {}    for iname in INPUT_NAMES:        features_i = np.array([getattr(f, iname) for f in text_features])        features_i = features_i.reshape(target_shape)        features_i = features_i.astype("int32")        feed_dict[iname] = features_i     return feed_dict

tf.Estimators有一個有趣的功能，可以在每次調用預測函數時重建並重新初始化整個計算圖。因此，為了避免開銷，將生成器傳遞給預測函數，並且生成器將在永無止境的循環中為模型生成特徵。
def build_input_fn(container):        def gen():        while True:          try:            yield build_feed_dict(container.get())          except StopIteration:            yield build_feed_dict(container.get())     def input_fn():        return tf.data.Dataset.from_generator(            gen,            output_types={iname: tf.int32 for iname in INPUT_NAMES},            output_shapes={iname: (None, None) for iname in INPUT_NAMES})    return input_fn class DataContainer:  def __init__(self):    self._texts = None    def set(self, texts):    if type(texts) is str:      texts = [texts]    self._texts = texts      def get(self):    return self._texts

model_fn包含模型的規範。在例子中，它是從上一步中保存的pbtxt文件加載的。功能通過input_map顯式映射到相應的輸入節點。
def model_fn(features, mode):    with tf.gfile.GFile(GRAPH_PATH, 'rb') as f:        graph_def = tf.GraphDef()        graph_def.ParseFromString(f.read())            output = tf.import_graph_def(graph_def,                                 input_map={k + ':0': features[k]                                             for k in INPUT_NAMES},                                 return_elements=['final_encodes:0'])     return EstimatorSpec(mode=mode, predictions={'output': output[0]})  estimator = Estimator(model_fn=model_fn)

現在幾乎擁有了進行推理所需的一切。開工吧！
def batch(iterable, n=1):    l = len(iterable)    for ndx in range(0, l, n):        yield iterable[ndx:min(ndx + n, l)] def build_vectorizer(_estimator, _input_fn_builder, batch_size=128):  container = DataContainer()  predict_fn = _estimator.predict(_input_fn_builder(container), yield_single_examples=False)    def vectorize(text, verbose=False):    x = []    bar = Progbar(len(text))    for text_batch in batch(text, batch_size):      container.set(text_batch)      x.append(next(predict_fn)['output'])      if verbose:        bar.add(len(text_batch))          r = np.vstack(x)    return r    return vectorize

可以在存儲庫中找到上述功能提取器的獨立版本。
https://github.com/gaphex/bert_experimental
 
>>> bert_vectorizer = build_vectorizer（estimator，build_input_fn）>>> bert_vectorizer（64 * ['sample text']）。shape （64,768 ）

第4步：使用Projector探索向量空間
現在是時候進行演示了！
 
使用矢量化器，將為Reuters-21578基準語料庫中的文章生成嵌入。
為了在3D中可視化和探索嵌入向量空間，將使用稱為T-SNE的降維技術。
 
先來看一下嵌入文章吧。
from nltk.corpus import reuters nltk.download("reuters")nltk.download("punkt") max_samples = 256categories = ['wheat', 'tea', 'strategic-metal',               'housing', 'money-supply', 'fuel'] S, X, Y = [], [], [] for category in categories:  print(category)    sents = reuters.sents(categories=category)  sents = [' '.join(sent) for sent in sents][:max_samples]  X.append(bert_vectorizer(sents, verbose=True))  Y += [category] * len(sents)  S += sents  X = np.vstack(X) X.shape

嵌入式投影儀可以使用生成的嵌入的交互式可視化。
 
可以自己運行T-SNE或使用右下角的書籤加載檢查點（加載僅適用於Chrome）。
 
 
第5步：構建搜尋引擎
現在，假設擁有50k文本樣本的知識庫，需要快速回答基於此數據的查詢。如何從文本資料庫中檢索與查詢最相似的樣本？答案是最近鄰搜索。
 
在形式上，將解決搜索問題定義如下：
給定一組點的小號在向量空間中號，並查詢點Q ∈ 中號，發現在最近點小號到Q。有多種方法可以在向量空間中定義「最接近」，將使用歐幾裡德距離。
 
因此要為文本構建搜尋引擎，將遵循以下步驟：
 
矢量化來自知識庫的所有樣本 - 得到S
向量化查詢 - 給出Q.
計算Q和S之間的歐氏距離D.
按升序排序D - 提供最相似樣本的索引
從知識庫中檢索所述樣本的標籤

為了簡單地實現這一點將在純TensorFlow中實現。
 
首先，為Q和S創建佔位符
dim = 1024graph = tf.Graph()sess = tf.InteractiveSession(graph=graph) Q = tf.placeholder("float", [dim])S = tf.placeholder("float", [None, dim])

定義歐氏距離計算
squared_distance = tf.reduce_sum(tf.pow(Q - S, 2), reduction_indices=1)distance = tf.sqrt(squared_distance)

最後，獲得最相似的樣本索引
top_k = 3 top_neg_dists, top_indices = tf.math.top_k(tf.negative(distance), k=top_k)top_dists = tf.negative(top_neg_dists)

第6步：用數學加速搜索
現在已經設置了基本的檢索算法，問題是：  
可以讓它運行得更快嗎？通過一點點數學可以的。
 
對於一對向量p和q，歐氏距離定義如下：
 
 
這正是在第4步中計算它的方式。
 
但是，由於p和q是向量，可以擴展並重寫它：
 
 
其中⟨...⟩表示內在產品。
 
在TensorFlow中，這可以寫成如下：
Q = tf.placeholder("float", [dim])S = tf.placeholder("float", [None, dim]) Qr = tf.reshape(Q, (1, -1)) PP = tf.keras.backend.batch_dot(S, S, axes=1)QQ = tf.matmul(Qr, tf.transpose(Qr))PQ = tf.matmul(S, tf.transpose(Qr)) distance = PP - 2 * PQ + QQdistance = tf.sqrt(tf.reshape(distance, (-1,))) top_neg_dists, top_indices = tf.math.top_k(tf.negative(distance), k=top_k)

由於矩陣乘法運算是高度優化的，因此該實現的工作速度比前一個略快。
 
順便說一下，在上面的公式中，PP和QQ實際上是各個向量的L2範數的平方。如果兩個向量都是L2歸一化的，則PP = QQ = 1.這給出了內積與歐氏距離之間的有趣關係：
 
 
然而，進行L2歸一化會丟棄關於矢量幅度的信息，這在很多情況下是不合需要的。
 
相反，可能會注意到，只要知識庫沒有改變，PP，其平方向量範數也保持不變。因此，不是每次重新計算它，而是使用預先計算的結果，進一步加速距離計算。
 
現在把它們放在一起。
class L2Retriever:    def __init__(self, dim, top_k=3, use_norm=False, use_gpu=True):        self.dim = dim        self.top_k = top_k        self.use_norm = use_norm        config = tf.ConfigProto(            device_count={'GPU': (1 if use_gpu else 0)}        )        config.gpu_options.allow_growth = True        self.session = tf.Session(config=config)                self.norm = None        self.query = tf.placeholder("float", [self.dim])        self.kbase = tf.placeholder("float", [None, self.dim])                self.build_graph()     def build_graph(self):              if self.use_norm:            self.norm = tf.placeholder("float", [None, 1])         distance = dot_l2_distances(self.kbase, self.query, self.norm)        top_neg_dists, top_indices = tf.math.top_k(tf.negative(distance), k=self.top_k)        top_dists = tf.negative(top_neg_dists)         self.top_distances = top_dists        self.top_indices = top_indices     def predict(self, kbase, query, norm=None):        query = np.squeeze(query)        feed_dict = {self.query: query, self.kbase: kbase}        if self.use_norm:          feed_dict[self.norm] = norm                I, D = self.session.run([self.top_indices, self.top_distances],                                feed_dict=feed_dict)                return I, D      def dot_l2_distances(kbase, query, norm=None):    query = tf.reshape(query, (1, -1))        if norm is None:      XX = tf.keras.backend.batch_dot(kbase, kbase, axes=1)    else:      XX = norm    YY = tf.matmul(query, tf.transpose(query))    XY = tf.matmul(kbase, tf.transpose(query))        distance = XX - 2 * XY + YY    distance = tf.sqrt(tf.reshape(distance, (-1,)))        return distance

示例：電影推薦系統
對於此示例，將使用IMDB中的電影摘要數據集。使用NLU和Retriever模塊，將構建一個電影推薦系統，用於建議具有類似繪圖功能的電影。
 
首先，下載並準備IMDB數據集。
http://www.cs.cmu.edu/~ark/personas/
 
import pandas as pdimport json !wget http:!tar -xvzf MovieSummaries.tar.gz plots_df = pd.read_csv('MovieSummaries/plot_summaries.txt', sep='\t', header=None)meta_df = pd.read_csv('MovieSummaries/movie.metadata.tsv', sep='\t', header=None) plot = {}metadata = {}movie_data = {} for movie_id, movie_plot in plots_df.values:  plot[movie_id] = movie_plot  for movie_id, movie_name, movie_genre in meta_df[[0,2,8]].values:  genre = list(json.loads(movie_genre).values())  if len(genre):    metadata[movie_id] = {"name": movie_name,                          "genre": genre}    for movie_id in set(plot.keys())&set(metadata.keys()):  movie_data[metadata[movie_id]['name']] = {"genre": metadata[movie_id]['genre'],                                            "plot": plot[movie_id]}  X, Y, names = [], [], [] for movie_name, movie_meta in movie_data.items():  X.append(movie_meta['plot'])  Y.append(movie_meta['genre'])  names.append(movie_name)

使用BERT NLU模塊矢量化電影情節：
X_vect = bert_vectorizer(X, verbose=True)
 
最後，使用L2Retriever，找到與查詢電影最相似的繪圖向量的電影，並將其返回給用戶。
def buildMovieRecommender(movie_names, vectorized_plots, top_k=10):  retriever = L2Retriever(vectorized_plots.shape[1], use_norm=True, top_k=top_k, use_gpu=False)  vectorized_norm = np.sum(vectorized_plots**2, axis=1).reshape((-1,1))    def recommend(query):    try:      idx = retriever.predict(vectorized_plots,                               vectorized_plots[movie_names.index(query)],                               vectorized_norm)[0][1:]      for i in idx:        print(names[i])    except ValueError:      print("{} not found in movie db. Suggestions:")      for i, name in enumerate(movie_names):        if query.lower() in name.lower():          print(i, name)            return recommend

來看看！
 
>>> recommend = buildMovieRecommender(names, X_vect)>>> recommend("The Matrix")Impostor Immortel Saturn 3 Terminator Salvation The Terminator Logan's Run Genesis II Tron: Legacy Blade Runner

即使沒有監督，該模型也可以在幾個分類和檢索任務中充分執行。雖然使用監督數據可以進一步提高性能，但所描述的文本特徵提取方法為下遊NLP解決方案提供了堅實的基線。
 
以上是使用BERT和TensorFlow構建搜尋引擎的指南。

推薦閱讀
哈工大訊飛聯合實驗室發布基於全詞覆蓋的中文BERT預訓練模型

點擊「閱讀原文」圖書配套資源

使用BERT和TensorFlow構建搜尋引擎

相關焦點

TFX 最新博文:如何使用 TensorFlow 生態系統實現快速高效的 BERT...

【注意力機制】transformers之轉換Tensorflow的Checkpoints

TensorFlow 攜手 NVIDIA,使用 TensorRT 優化 TensorFlow Serving...

深度解讀TensorFlow,了解它的最新發展!

Keras 教程:BERT 文本摘要

作為TensorFlow的底層語言,你會用C++構建深度神經網絡嗎?

TensorFlow極速入門

步履不停:TensorFlow 2.4新功能一覽!

谷歌開放GNMT教程:如何使用TensorFlow構建自己的神經機器翻譯系統

TensorFlow 資源大全中文版

TensorFlow極簡教程:創建、保存和恢復機器學習模型

Keras和TensorFlow究竟哪個會更好?

在Windows中安裝Tensorflow和Kears深度學習框架

基於RTX2060構建TensorFlow-gpu(keras)學習平臺

玩轉TensorFlow?你需要知道這30功能

如何使用TensorFlow Hub的ESRGAN模型來在安卓app中生成超分圖片

關於TensorFlow,你應該了解的9件事

5個簡單的步驟掌握Tensorflow的Tensor

Tensorflow基礎教程15天之創建Tensor

tensorflow初級必學算子