Boblee人工智慧碩士畢業,擅長及愛好python,基於python研究人工智慧、群體智能、區塊鏈等技術,並使用python開發前後端、爬蟲等。
本文基於https://zhuanlan.zhihu.com/p/28979653進行修改。
word2vec簡要介紹
word2vec 是 Google 於 2013 年開源推出的一個用於獲取 word vector 的工具包,它簡單、高效,因此引起了很多人的關注。對word2vec數學原理感興趣的可以移步word2vec 中的數學原理詳解,這裡就不具體介紹。word2vec對詞向量的訓練有兩種方式,一種是CBOW模型,即通過上下文來預測中心詞;另一種skip-Gram模型,即通過中心詞來預測上下文。其中CBOW對小型數據比較適合,而skip-Gram模型在大型的訓練語料中表現更好。
2. 同義詞訓練
本文基於哈工大提供的同義詞詞林進行訓練,數據集請見: https://github.com/TernenceWind/replaceSynbycilin
本文訓練採取以下步驟:
1. 提取同義詞將所有詞組成一個列表,為構建詞頻統計,詞典及反轉詞典。因為計算機不能理解中文,我們必須把文字轉換成數值來替代。
2.構建訓練集,每次隨機選擇一行同義詞,比如選取「人類、主人、全人類」這一行,輸入「人類」預測「主人」,輸入「主人」預測「全人類」。
3.python實現
初始化數據
from __future__ import absolute_importfrom __future__ import divisionfrom __future__ import print_functionfrom random import choiceimport collectionsimport mathimport randomimport numpy as npfrom six.moves import xrangeimport tensorflow as tfdata = []all_word = []for line in open('cilin.txt', 'r', encoding='utf8'): line = line.replace('\n','').split(' ')[1:] data.append(line) for element in line: if element not in all_word: all_word.append(element)dictionary = [i for i in range(len(all_word))]reverse_dictionary_ = dict(zip(dictionary, all_word))reverse_dictionary = dict(zip(all_word, dictionary))獲取訓練集
batch_size = 128embedding_size = 128skip_window = 1num_skips = 2valid_size = 4 valid_window = 100num_sampled = 64 vocabulary_size =len(all_word)valid_word = ['專家','住戶','祖父','家鄉']valid_examples =[reverse_dictionary[li] for li in valid_word]def generate_batch(data,batch_size): data_input = np.ndarray(shape=(batch_size), dtype=np.int32) data_label = np.ndarray(shape=(batch_size, 1), dtype=np.int32) for i in range(batch_size): slice = random.sample( choice(data), 2) data_input[i ] = reverse_dictionary[slice[0]] data_label[i , 0] = reverse_dictionary[slice[1]] return data_input, data_label構建網絡
graph = tf.Graph()with graph.as_default(): train_inputs = tf.placeholder(tf.int32, shape=[batch_size]) train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1]) valid_dataset = tf.constant(valid_examples, dtype=tf.int32) with tf.device('/cpu:0'): embeddings = tf.Variable( tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0)) embed = tf.nn.embedding_lookup(embeddings, train_inputs) nce_weights = tf.Variable( tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0 / math.sqrt(embedding_size))) nce_biases = tf.Variable(tf.zeros([vocabulary_size]),dtype=tf.float32) loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights, biases=nce_biases, inputs=embed, labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size)) optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss) norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True)) normalized_embeddings = embeddings / norm valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset) similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True) init = tf.global_variables_initializer()開始訓練
num_steps = 2000000with tf.Session(graph=graph) as session: init.run() print("Initialized") average_loss = 0 for step in xrange(num_steps): batch_inputs, batch_labels = generate_batch(data,batch_size) feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels} _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict) average_loss += loss_val if step % 2000 == 0: if step > 0: average_loss /= 2000 print("Average loss at step ", step, ": ", average_loss) average_loss = 0 if step % 10000 == 0: sim = similarity.eval() for i in xrange(valid_size): print(valid_examples[i]) valid_word = reverse_dictionary_[valid_examples[i]] top_k = 8 nearest = (-sim[i, :]).argsort()[:top_k] log_str = "Nearest to %s:" % valid_word for k in xrange(top_k): close_word = reverse_dictionary_[nearest[k]] log_str = "%s %s," % (log_str, close_word) print(log_str) final_embeddings = normalized_embeddings.eval() np.save('tongyi.npy',final_embeddings)可視化
def plot_with_labels(low_dim_embs, labels, filename='images/tsne3.png',fonts=None): assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings" plt.figure(figsize=(15, 15)) for i, label in enumerate(labels): x, y = low_dim_embs[i, :] plt.scatter(x, y) plt.annotate(label, fontproperties=fonts, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
plt.savefig(filename,dpi=800)
try: from sklearn.manifold import TSNE import matplotlib.pyplot as plt from matplotlib.font_manager import FontProperties font = FontProperties(fname=r"simhei.ttf", size=14)
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000) plot_only = 100 low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :]) labels = [reverse_dictionary_[i] for i in xrange(plot_only)] plot_with_labels(low_dim_embs, labels,fonts=font)except ImportError: print("Please install sklearn, matplotlib, and scipy to visualize embeddings.")4.實驗結果
訓練結果
從圖中可見 同義詞基本訓練出來了。
5.結語
本文基於前人的基礎上實現了同義詞詞向量生成,下一步就可以用於nlp文本處理了。