教程 | 如何在Python中快速進行語料庫搜索:近似最近鄰算法

2021-03-04 機器之心

最近，我一直在研究在 GloVe 詞嵌入中做加減法。例如，我們可以把「king」的詞嵌入向量減去「man」的詞嵌入向量，隨後加入「woman」的詞嵌入得到一個結果向量。隨後，如果我們有這些詞嵌入對應的語料庫，那麼我們可以通過搜索找到最相似的嵌入並檢索相應的詞。如果我們做了這樣的查詢，我們會得到：

King + (Woman - Man) = Queen

我們有很多方法來搜索語料庫中詞嵌入對作為最近鄰查詢方式。絕對可以確保找到最優向量的方式是遍歷你的語料庫，比較每個對與查詢需求的相似程度——這當然是耗費時間且不推薦的。一個更好的技術是使用向量化餘弦距離方式，如下所示：

vectors = np.array(embeddingmodel.embeddings)

ranks = np.dot(query,vectors.T)/np.sqrt(np.sum(vectors**2,1))

mostSimilar = []

[mostSimilar.append(idx) for idx in ranks.argsort()[::-1]]

想要了解餘弦距離，可以看看這篇文章：http://masongallo.github.io/machine/learning,/python/2016/07/29/cosine-similarity.html

矢量化的餘弦距離比迭代法快得多，但速度可能太慢。是近似最近鄰搜索算法該出現時候了：它可以快速返回近似結果。很多時候你並不需要準確的最佳結果，例如：「Queen」這個單詞的同義詞是什麼？在這種情況下，你只需要快速得到足夠好的結果，你需要使用近似最近鄰搜索算法。

在本文中，我們將會介紹一個簡單的 Python 腳本來快速找到近似最近鄰。我們會使用的 Python 庫是 Annoy 和 Imdb。對於我的語料庫，我會使用詞嵌入對，但該說明實際上適用於任何類型的嵌入：如音樂推薦引擎需要用到的歌曲嵌入，甚至以圖搜圖中的圖片嵌入。

製作一個索引

讓我們創建一個名為：「make_annoy_index」的 Python 腳本。首先我們需要加入用得到的依賴項：

'''

Usage: python2 make_annoy_index.py \

--embeddings=<embedding path> \

--num_trees=<int> \

--verbose

Generate an Annoy index and lmdb map given an embedding file

Embedding file can be

1. A .bin file that is compatible with word2vec binary formats.

There are pre-trained vectors to download at https://code.google.com/p/word2vec/

2. A .gz file with the GloVe format (item then a list of floats in plaintext)

3. A plain text file with the same format as above

'''

import annoy

import lmdb

import os

import sys

import argparse

from vector_utils import get_vectors

最後一行裡非常重要的是「vector_utils」。稍後我們會寫「vector_utils」，所以不必擔心。

接下來，讓我們豐富這個腳本：加入「creat_index」函數。這裡我們將生成 lmdb 圖和 Annoy 索引。

1. 首先需要找到嵌入的長度，它會被用來做實例化 Annoy 的索引。

2. 接下來實例化一個 Imdb 圖，使用：「env = lmdb.open(fn_lmdb, map_size=int(1e9))」。

3. 確保我們在當前路徑中沒有 Annoy 索引或 lmdb 圖。

4. 將嵌入文件中的每一個 key 和向量添加至 lmdb 圖和 Annoy 索引。

5. 構建和保存 Annoy 索引。

'''

function create_index(fn, num_trees=30, verbose=False)

Creates an Annoy index and lmdb map given an embedding file fn

Input:

fn - filename of the embedding file

num_trees - number of trees to build Annoy index with

verbose - log status

Return:

Void

'''

def create_index(fn, num_trees=30, verbose=False):

fn_annoy = fn + '.annoy'

fn_lmdb = fn + '.lmdb' # stores word <-> id mapping

word, vec = get_vectors(fn).next()

size = len(vec)

if verbose:

print("Vector size: {}".format(size))

env = lmdb.open(fn_lmdb, map_size=int(1e9))

if not os.path.exists(fn_annoy) or not os.path.exists(fn_lmdb):

i = 0

a = annoy.AnnoyIndex(size)

with env.begin(write=True) as txn:

for word, vec in get_vectors(fn):

a.add_item(i, vec)

id = 'i%d' % i

word = 'w' + word

txn.put(id, word)

txn.put(word, id)

i += 1

if verbose:

if i % 1000 == 0:

print(i, '...')

if verbose:

print("Starting to build")

a.build(num_trees)

if verbose:

print("Finished building")

a.save(fn_annoy)

if verbose:

print("Annoy index saved to: {}".format(fn_annoy))

print("lmdb map saved to: {}".format(fn_lmdb))

else:

print("Annoy index and lmdb map already in path")

我已經推斷出 argparse，因此，我們可以利用命令行啟用我們的腳本：

'''

private function _create_args()

Creates an argeparse object for CLI for create_index() function

Input:

Void

Return:

args object with required arguments for threshold_image() function

'''

def _create_args():

parser = argparse.ArgumentParser()

parser.add_argument("--embeddings", help="filename of the embeddings", type=str)

parser.add_argument("--num_trees", help="number of trees to build index with", type=int)

parser.add_argument("--verbose", help="print logging", action="store_true")

args = parser.parse_args()

return args

添加主函數以啟用腳本，得到 make_annoy_index.py：

if __name__ == '__main__':

args = _create_args()

create_index(args.embeddings, num_trees=args.num_trees, verbose=args.verbose)

現在我們可以僅利用命令行啟用新腳本，以生成 Annoy 索引和對應的 lmdb 圖！

python2 make_annoy_index.py \

--embeddings=<embedding path> \

--num_trees=<int> \

--verbose

寫向量Utils

我們在 make_annoy_index.py 中推導出 Python 腳本 vector_utils。現在要寫該腳本，Vector_utils 用於幫助讀取.txt, .bin 和 .pkl 文件中的向量。

寫該腳本與我們現在在做的不那麼相關，因此我已經推導出整個腳本，如下：

'''

Vector Utils

Utils to read in vectors from txt, .bin, or .pkl.

Taken from Erik Bernhardsson

Source: https://github.com/erikbern/ann-presentation/blob/master/util.py

'''

import gzip

import struct

import cPickle

def _get_vectors(fn):

if fn.endswith('.gz'):

f = gzip.open(fn)

fn = fn[:-3]

else:

f = open(fn)

if fn.endswith('.bin'): # word2vec format

words, size = (int(x) for x in f.readline().strip().split())

t = 'f' * size

while True:

pos = f.tell()

buf = f.read(1024)

if buf == '' or buf == '\n': return

i = buf.index(' ')

word = buf[:i]

f.seek(pos + i + 1)

vec = struct.unpack(t, f.read(4 * size))

yield word.lower(), vec

elif fn.endswith('.txt'): # Assume simple text format

for line in f:

items = line.strip().split()

yield items[0], [float(x) for x in items[1:]]

elif fn.endswith('.pkl'): # Assume pickle (MNIST)

i = 0

for pics, labels in cPickle.load(f):

for pic in pics:

yield i, pic

i += 1

def get_vectors(fn, n=float('inf')):

i = 0

for line in _get_vectors(fn):

yield line

i += 1

if i >= n:

break

測試 Annoy 索引和 lmdb 圖

我們已經生成了 Annoy 索引和 lmdb 圖，現在我們來寫一個腳本使用它們進行推斷。

將我們的文件命名為 annoy_inference.py，得到下列依賴項：

'''

Usage: python2 annoy_inference.py \

--token='hello' \

--num_results=<int> \

--verbose

Query an Annoy index to find approximate nearest neighbors

'''

import annoy

import lmdb

import argparse

現在我們需要在 Annoy 索引和 lmdb 圖中加載依賴項，我們將進行全局加載，以方便訪問。注意，這裡設置的 VEC_LENGTH 為 50。確保你的 VEC_LENGTH 與嵌入長度匹配，否則 Annoy 會不開心的哦～

VEC_LENGTH = 50

FN_ANNOY = 'glove.6B.50d.txt.annoy'

FN_LMDB = 'glove.6B.50d.txt.lmdb'

a = annoy.AnnoyIndex(VEC_LENGTH)

a.load(FN_ANNOY)

env = lmdb.open(FN_LMDB, map_size=int(1e9))

有趣的部分在於「calculate」函數。

1. 從 lmdb 圖中獲取查詢索引；

2. 用 get_item_vector(id) 獲取 Annoy 對應的向量；

3. 用 a.get_nns_by_vector(v, num_results) 獲取 Annoy 的最近鄰。

'''

private function calculate(query, num_results)

Queries a given Annoy index and lmdb map for num_results nearest neighbors

Input:

query - query to be searched

num_results - the number of results

Return:

ret_keys - list of num_results nearest neighbors keys

'''

def calculate(query, num_results, verbose=False):

ret_keys = []

with env.begin() as txn:

id = int(txn.get('w' + query)[1:])

if verbose:

print("Query: {}, with id: {}".format(query, id))

v = a.get_item_vector(id)

for id in a.get_nns_by_vector(v, num_results):

key = txn.get('i%d' % id)[1:]

ret_keys.append(key)

if verbose:

print("Found: {} results".format(len(ret_keys)))

return ret_keys

再次，這裡使用 argparse 來使讀取命令行參數更加簡單。

'''

private function _create_args()

Creates an argeparse object for CLI for calculate() function

Input:

Void

Return:

args object with required arguments for threshold_image() function

'''

def _create_args():

parser = argparse.ArgumentParser()

parser.add_argument("--token", help="query word", type=str)

parser.add_argument("--num_results", help="number of results to return", type=int)

parser.add_argument("--verbose", help="print logging", action="store_true")

args = parser.parse_args()

return args

主函數從命令行中啟用 annoy_inference.py。

if __name__ == '__main__':

args = _create_args()

print(calculate(args.token, args.num_results, args.verbose))

現在我們可以使用 Annoy 索引和 lmdb 圖，獲取查詢的最近鄰！

python2 annoy_inference.py --token="test" --num_results=30

['test', 'tests', 'determine', 'for', 'crucial', 'only', 'preparation', 'needed', 'positive', 'guided', 'time', 'performance', 'one', 'fitness', 'replacement', 'stages', 'made', 'both', 'accuracy', 'deliver', 'put', 'standardized', 'best', 'discovery', '.', 'a', 'diagnostic', 'delayed', 'while', 'side']

代碼

本教程所有代碼的 GitHub 地址：https://github.com/kyang6/annoy_tutorial

原文地址：https://medium.com/@kevin_yang/simple-approximate-nearest-neighbors-in-python-with-annoy-and-lmdb-e8a701baf905

本文為機器之心編譯，轉載請聯繫本公眾號獲得授權。

✄---

加入機器之心（全職記者/實習生）：hr@jiqizhixin.com

投稿或尋求報導：editor@jiqizhixin.com

廣告&商務合作：bd@jiqizhixin.com

教程 | 如何在Python中快速進行語料庫搜索:近似最近鄰算法

相關焦點

如何在Python中進行二進位搜索?搜索算法,中級python技術點

算法基礎:五大排序算法Python實戰教程

常見面試算法:k-近鄰算法原理與python案例實現

實例最簡,帶你輕鬆進階機器學習K最近鄰算法

如何選擇最佳的最近鄰算法

什麼是算法?快速學會使用python編寫算法

貪心算法與近似算法

極簡python教程:快速入門好方法

從零開始:用Python實現KNN算法

Python中的Timsort算法,Timsort算法的優缺點,中級python技術點

快速掌握 Spacy 在 Python 中進行自然語言處理

python的中文文本挖掘庫snownlp進行購物評論文本情感分析實例

Python中的快速排序算法,快速排序的優缺點,中級python技術點

Python中排序算法的重要性,希爾排序 ShellSort,中級python技術

Python中的插入排序算法,插入排序的優缺點,中級python技術點

Python中的冒泡排序算法,冒泡排序的優缺點,中級python技術點

python教程

「近水樓臺先得月」——理解KNN算法

KDD2020 | 揭秘Facebook搜索中的語義檢索技術

Python基礎 | 大學小白如何入門Python程序設計