fastText的源碼,看這一篇就夠了!

2021-03-01 貝殼智搜

面對一個新的模型，我們可能會想這個模型誰提出的，可以做什麼。然後考慮是否有需要，開始用之前，會問這個模型支持哪些參數，數據格式是什麼，至此，可以試試效果。試完效果發現咦還不錯，後面可能會橫向、縱向比較。使用一段時候開始研究源碼，除可以弄清楚原理外，還可以學習很多編程技巧，知其所以然。因此，本解析將會圍繞這些展開討論。

1 fasteText能幹什麼1.1 fastText是什麼

fastText是Facebook AI Reserch在16年開源的一個詞向量及文本分類工具，性能比肩深度學習而且速度更快，能夠訓練模型「在使用標準多核CPU的情況下10分鐘內處理超過10億個詞彙」，特別是與深度模型對比，fastText能將訓練時間由數天縮短到幾秒鐘。並且，它在模型架構上跟word2vec非常相似，兩者作者都是Tomas Mikolov。

1.2 fastText支持哪些參數

現在已經知道fastText能夠幹什麼，而且詞向量已經有成名的word2vec，所以讓我們聚焦在fastText的另外一個功能分類。在使用模型前，比較重要的就是模型提供了什麼參數，具體參考fastText源碼[1]。

fastText 提供的幾個功能

參數說明supervised監督學習，訓練分類器test評價一個分類器predict預測標籤predict-prob預測標籤，並輸出預測概率skipgram訓練skipgram模型cbow訓練cbow模型print-vectors輸入一個訓練好的模型，列印向量

fastText 分類提供的參數

參數說明默認值-inputtraining file path
-outputoutput file path
-verboseverbosity level2-minCountminimal number of word occurrences[1]-minCountLabelminimal number of label occurrences[0]-wordNgramsword ngram的最大長度[1]-bucket提供word 和char ngram最大值[2000000]-minnchar ngram的最小長度[0]-maxnchar ngram 的最大長度[0]-tsampling threshold[0.0001]-labellabels prefix[__label__]-lrlearning rate[0.1]-lrUpdateRatechange the rate of updates for the learning rate[100]-dimsize of word vectors[100]-wssize of the context window[5]-epochnumber of epochs[5]-negnumber of negatives sampled[5]-lossloss function {ns, hs, softmax}[softmax]-threadnumber of threads[12]-pretrainedVectors對於分類問題可以提供與訓練的詞向量[]-saveOutputwhether output params should be saved[0]1.3 fastText 輸入輸出

輸入:一行為一條訓練數據,每個詞可以用空格等常用分隔符分割,對於分類問題來說,在類別前面加"__label__"表示其為該行訓練數據的標籤,當然可以自定義了,可以使用-label自定義,訓練文件支持 UTF-8 格式,字符會按照UTF-8編碼分割。

輸出:輸出支持topK,帶概率.

1.4 fastText 為什麼如此快2 fastText與word2vec的異同

如果有對word2vec原理不清楚的可以看這篇博客word2vec[2]，詳細介紹了word2vec的數學原理。

2.1 相同點

了解word2vec的同學，可以發現fasttext的模型結構和思想基本一致，倒是省去了理解的步驟。

2.2 不同點3 源碼解析

fasttext整體的結構圖如下：$$x_i$$為根據dictionary找到每個詞對應的id，根據id取對應的向量。然後，求和平均得到中間隱變量h。最後，經過Negative Sampling、Softmax或者Hierarchical Softmax進行分類。

3.1 源碼結構

源碼提供了以下幾個模塊，核心model模塊提供各種算法：

3.2 頭文件解析類說明args.h提供主要參數解析dictionary.h主要模塊，提供計數、映射、ngramfasttext.h集合各個類matrix.h詞向量矩陣model.h核心模塊，各種算法real.h數據類型utils.h文件相關的工具類vector.h向量抽象類3.3 主要類fasttext.h

fasttext 是最頂層的模塊，它的主要功能是訓練和預測，首先是訓練功能的調用路徑，第一個函數是 train，它的主要作用是初始化參數，啟動多線程訓練。

// 訓練

void FastText::train(const Args args) {

args_ = std::make_shared<Args>(args);

dict_ = std::make_shared<Dictionary>(args_);

if (args_->input == "-") {

// manage expectations

throw std::invalid_argument("Cannot use stdin for training!");

}

std::ifstream ifs(args_->input);

if (!ifs.is_open()) {

throw std::invalid_argument(

args_->input + " cannot be opened for training!");

}

// 字典的構建

dict_->readFromFile(ifs);

ifs.close();

// 加載預訓練的詞向量

if (args_->pretrainedVectors.size() != 0) {

loadVectors(args_->pretrainedVectors);

} else {

// 否則隨機生成詞典，詞典的大小等於詞的個數加上ngram個個數或者說是參數bucket的大小

input_ = std::make_shared<Matrix>(dict_->nwords()+args_->bucket, args_->dim);

input_->uniform(1.0 / args_->dim);

}

// 區分是訓練詞向量還是分了，本質上一樣，只是一個目標是詞一個是類別

// 因此輸出矩陣大大小不一樣

if (args_->model == model_name::sup) {

output_ = std::make_shared<Matrix>(dict_->nlabels(), args_->dim);

} else {

output_ = std::make_shared<Matrix>(dict_->nwords(), args_->dim);

}

output_->zero();

startThreads();

model_ = std::make_shared<Model>(input_, output_, args_, 0);

if (args_->model == model_name::sup) {

model_->setTargetCounts(dict_->getCounts(entry_type::label));

} else {

model_->setTargetCounts(dict_->getCounts(entry_type::word));

}

具體多線程訓練部分，按照線程數把文件分成多個部分，每個部分獨立更新參數，線程之間並沒有加鎖。因此，會帶來一點噪音，但是對最後的結果沒有多大影響。word2vec 實現也沒有加鎖。

void FastText::trainThread(int32_t threadId) {

std::ifstream ifs(args_->input);

// 根據線程id選擇要訓練的數據，按照字節數分割，不是按照行分割

utils::seek(ifs, threadId * utils::size(ifs) / args_->thread);

// 每個線程都會新建一個model，但是model參數是共享的

Model model(input_, output_, args_, threadId);

// 區分是詞向量還是分類

if (args_->model == model_name::sup) {

model.setTargetCounts(dict_->getCounts(entry_type::label));

} else {

model.setTargetCounts(dict_->getCounts(entry_type::word));

}

const int64_t ntokens = dict_->ntokens();

int64_t localTokenCount = 0;

std::vector<int32_t> line, labels;

while (tokenCount_ < args_->epoch * ntokens) {

real progress = real(tokenCount_) / (args_->epoch * ntokens);

real lr = args_->lr * (1.0 - progress);

// 不同的模型選擇不同的訓練方式

if (args_->model == model_name::sup) {

localTokenCount += dict_->getLine(ifs, line, labels);

supervised(model, lr, line, labels);

} else if (args_->model == model_name::cbow) {

localTokenCount += dict_->getLine(ifs, line, model.rng);

cbow(model, lr, line);

} else if (args_->model == model_name::sg) {

localTokenCount += dict_->getLine(ifs, line, model.rng);

skipgram(model, lr, line);

}

// 迭代過程中，根據迭代次數，降低學習率

if (localTokenCount > args_->lrUpdateRate) {

tokenCount_ += localTokenCount;

localTokenCount = 0;

if (threadId == 0 && args_->verbose > 1)

loss_ = model.getLoss();

}

if (threadId == 0)

loss_ = model.getLoss();

ifs.close();

}

你沒有看錯，fasttext提供的三個功能，它們的代碼很相似，最後的更新後抽象成model.update(line, labels[i], lr)。supervised的輸入就是詞對應的id、標籤、學習率等。model的update見3.5 model分析。

void FastText::supervised(

Model& model,

real lr,

const std::vector<int32_t>& line,

const std::vector<int32_t>& labels) {

if (labels.size() == 0 || line.size() == 0) return;

// 支持一條樣本多個標籤，隨機選擇一個標籤

std::uniform_int_distribution<> uniform(0, labels.size() - 1);

int32_t i = uniform(model.rng);

// 更新參數

model.update(line, labels[i], lr);

}

void FastText::cbow(Model& model, real lr,

const std::vector<int32_t>& line) {

std::vector<int32_t> bow;

std::uniform_int_distribution<> uniform(1, args_->ws);

for (int32_t w = 0; w < line.size(); w++) {

int32_t boundary = uniform(model.rng);

bow.clear();

// 選擇上下文，預測中間詞

for (int32_t c = -boundary; c <= boundary; c++) {

if (c != 0 && w + c >= 0 && w + c < line.size()) {

const std::vector<int32_t>& ngrams =

dict_->getSubwords(line[w + c]);

bow.insert(bow.end(), ngrams.cbegin(), ngrams.cend());

}

model.update(bow, line[w], lr);

}

void FastText::skipgram(Model& model, real lr,

const std::vector<int32_t>& line) {

std::uniform_int_distribution<> uniform(1, args_->ws);

for (int32_t w = 0; w < line.size(); w++) {

int32_t boundary = uniform(model.rng);

const std::vector<int32_t>& ngrams = dict_->getSubwords(line[w]);

// 根據中間詞，預測上下文

for (int32_t c = -boundary; c <= boundary; c++) {

if (c != 0 && w + c >= 0 && w + c < line.size()) {

model.update(ngrams, line[w + c], lr);

}

當然fasttext.h還提供了預測的代碼，相對訓練來說就很簡單了，主要就是調用了model的predict函數，會在model中介紹。

3.4 主要類 dictionary.h

dictionary類中詞的保存格式如下，還保存了類型和char gram。

struct entry {

std::string word; // 詞

int64_t count; // 出現的次數

entry_type type; // 類型，是詞還是類別標籤

std::vector<int32_t> subwords; //char gram

};

整個dictionary類的主要函數功能說明

int32_t Dictionary::find(const std::string& w, uint32_t h) const {

int32_t word2intsize = word2int_.size();

int32_t id = h % word2intsize;

// 根據word 和 hash值查找其id

while (word2int_[id] != -1 && words_[word2int_[id]].word != w) {

id = (id + 1) % word2intsize;

}

return id;

}

void Dictionary::add(const std::string& w) {

int32_t h = find(w);

ntokens_++;

// 將詞添加到words_中

if (word2int_[h] == -1) {

entry e;

e.word = w;

e.count = 1;

e.type = getType(w);

words_.push_back(e);

word2int_[h] = size_++;

} else {

words_[word2int_[h]].count++;

}

void Dictionary::getSubwords(const std::string& word,

std::vector<int32_t>& ngrams,

std::vector<std::string>& substrings) const {

int32_t i = getId(word);

ngrams.clear();

substrings.clear();

if (i >= 0) {

ngrams.push_back(i);

substrings.push_back(words_[i].word);

}

if (word != EOS) {

computeSubwords(BOW + word + EOW, ngrams, &substrings);

}

uint32_t Dictionary::hash(const std::string& str) const {

uint32_t h = 2166136261;

for (size_t i = 0; i < str.size(); i++) {

h = h ^ uint32_t(int8_t(str[i]));

h = h * 16777619;

}

return h;

}

void Dictionary::computeSubwords(const std::string& word,

std::vector<int32_t>& ngrams,

std::vector<std::string>* substrings) const {

for (size_t i = 0; i < word.size(); i++) {

std::string ngram;

if ((word[i] & 0xC0) == 0x80) continue;

for (size_t j = i, n = 1; j < word.size() && n <= args_->maxn; n++) {

ngram.push_back(word[j++]);

while (j < word.size() && (word[j] & 0xC0) == 0x80) {

ngram.push_back(word[j++]);

}

if (n >= args_->minn && !(n == 1 && (i == 0 || j == word.size()))) {

int32_t h = hash(ngram) % args_->bucket;

pushHash(ngrams, h);

if (substrings) {

substrings->push_back(ngram);

}

void Dictionary::threshold(int64_t t, int64_t tl) {

sort(words_.begin(), words_.end(), [](const entry& e1, const entry& e2) {

if (e1.type != e2.type) return e1.type < e2.type;

return e1.count > e2.count;

});

words_.erase(remove_if(words_.begin(), words_.end(), [&](const entry& e) {

return (e.type == entry_type::word && e.count < t) ||

(e.type == entry_type::label && e.count < tl);

}), words_.end());

words_.shrink_to_fit();

size_ = 0;

nwords_ = 0;

nlabels_ = 0;

std::fill(word2int_.begin(), word2int_.end(), -1);

for (auto it = words_.begin(); it != words_.end(); ++it) {

int32_t h = find(it->word);

word2int_[h] = size_++;

if (it->type == entry_type::word) nwords_++;

if (it->type == entry_type::label) nlabels_++;

}

void Dictionary::addWordNgrams(std::vector<int32_t>& line,

const std::vector<int32_t>& hashes,

int32_t n) const {

for (int32_t i = 0; i < hashes.size(); i++) {

uint64_t h = hashes[i];

for (int32_t j = i + 1; j < hashes.size() && j < i + n; j++) {

h = h * 116049371 + hashes[j];

pushHash(line, h % args_->bucket);

}

void Dictionary::addSubwords(std::vector<int32_t>& line,

const std::string& token,

int32_t wid) const {

if (wid < 0) { // out of vocab

if (token != EOS) {

computeSubwords(BOW + token + EOW, line);

}

} else {

if (args_->maxn <= 0) { // in vocab w/o subwords

line.push_back(wid);

} else { // in vocab w/ subwords

const std::vector<int32_t>& ngrams = getSubwords(wid);

line.insert(line.end(), ngrams.cbegin(), ngrams.cend());

}

3.5 核心類 model.h

盜圖一張，很明白清楚的指出了model類做了些什麼事情，參數和源碼參數保持一致。

所有核心算法都集中到了model類，裡面提供了各種實現，包括LR，Huffman、BP等，還是很有意思的。同時這個模塊把negativeSampling、hierarchicalSoftmax都抽象成LR。

首先，讓我們看下binaryLogistic這個函數，損失函數為常見的交叉熵損失函數。如果是正類，則為f(x)=-log(score)，如果為負類，則為：f(x)=-log(1-score) 現在看下其導數是什麼，為了保持和代碼一致，都使用源碼的符號，推導如下：我們令x=wo->dotRow(hidden, target)，我們都知道sigmoid函數的偏導為socre*(1-socre)，對於正類而言：

我們的目標是使得損失函數最小，所以是梯度下降，再結合學習率，因此hidden_的梯度更新如下：

real Model::binaryLogistic(int32_t target, bool label, real lr) {

real score = sigmoid(wo_->dotRow(hidden_, target));

real alpha = lr * (real(label) - score);

// 參考上文公式推導

grad_.addRow(*wo_, target, alpha);

wo_->addRow(hidden_, target, alpha);

if (label) {

return -log(score);

} else {

return -log(1.0 - score);

}

具體negativeSampling、hierarchicalSoftmax、Softmax的實現

real Model::negativeSampling(int32_t target, real lr) {

real loss = 0.0;

grad_.zero();

for (int32_t n = 0; n <= args_->neg; n++) {

// 負採樣，一個正樣本，n個負樣本

if (n == 0) {

loss += binaryLogistic(target, true, lr);

} else {

loss += binaryLogistic(getNegative(target), false, lr);

}

return loss;

}

real Model::hierarchicalSoftmax(int32_t target, real lr) {

real loss = 0.0;

grad_.zero();

const std::vector<bool>& binaryCode = codes[target];

const std::vector<int32_t>& pathToRoot = paths[target];

// 根據目標點，找到其路徑，和對應的二分類的標籤

for (int32_t i = 0; i < pathToRoot.size(); i++) {

loss += binaryLogistic(pathToRoot[i], binaryCode[i], lr);

}

return loss;

}

real Model::softmax(int32_t target, real lr) {

grad_.zero();

// 計算和所有的label之間的值，如果為目標值，則對應為1，否則為0

computeOutputSoftmax();

for (int32_t i = 0; i < osz_; i++) {

real label = (i == target) ? 1.0 : 0.0;

real alpha = lr * (label - output_[i]);

grad_.addRow(*wo_, i, alpha);

wo_->addRow(hidden_, i, alpha);

}

return -log(output_[target]);

}

看一下隱藏層的計算，需要注意的是：求和平均

void Model::computeHidden(const std::vector<int32_t>& input, Vector& hidden)

const {

assert(hidden.size() == hsz_);

hidden.zero();

for (auto it = input.cbegin(); it != input.cend(); ++it) {

if (quant_) {

hidden.addRow(*qwi_, *it);

} else {

hidden.addRow(*wi_, *it);

}

// 求和平均

hidden.mul(1.0 / input.size());

}

預測的代碼，只有當用於分類時，其才有用，基本操作，用堆保留最大的幾個值。如果為最後一層為hierarchicalSoftmax，這需要遍歷整個Huffman樹，找其topK，對應源碼dfs函數，實現了深度優先搜索。如果為Softmax，直接計算即可，找其topK，對應源碼findKBest。

void Model::predict(

const std::vector<int32_t>& input,

int32_t k,

real threshold,

std::vector<std::pair<real, int32_t>>& heap,

Vector& hidden,

Vector& output) const {

// 判斷k值是否符合要求

if (k == Model::kUnlimitedPredictions) {

k = osz_;

} else if (k <= 0) {

throw std::invalid_argument("k needs to be 1 or higher!");

}

// 判斷當前模型是否是分類模型

if (args_->model != model_name::sup) {

throw std::invalid_argument("Model needs to be supervised for prediction!");

}

heap.reserve(k + 1);

computeHidden(input, hidden);

// 根據最後一層不同，分別選擇不同的遍歷方式，找topk

if (args_->loss == loss_name::hs) {

dfs(k, threshold, 2 * osz_ - 2, 0.0, heap, hidden);

} else {

findKBest(k, threshold, heap, hidden, output);

}

std::sort_heap(heap.begin(), heap.end(), comparePairs);

}

void Model::findKBest(

int32_t k,

real threshold,

std::vector<std::pair<real, int32_t>>& heap,

Vector& hidden,

Vector& output) const {

computeOutputSoftmax(hidden, output);

for (int32_t i = 0; i < osz_; i++) {

if (output[i] < threshold) {

continue;

}

if (heap.size() == k && std_log(output[i]) < heap.front().first) {

continue;

}

// 把元素放到末尾

heap.push_back(std::make_pair(std_log(output[i]), i));

// 調整最小堆

std::push_heap(heap.begin(), heap.end(), comparePairs);

if (heap.size() > k) {

// 堆頂元素pop出堆中

std::pop_heap(heap.begin(), heap.end(), comparePairs);

// 刪除最後一個元素

heap.pop_back();

}

void Model::dfs(

int32_t k,

real threshold,

int32_t node,

real score,

std::vector<std::pair<real, int32_t>>& heap,

Vector& hidden) const {

// 由於取的是對數，所以隨著深度的加深，socre是越來越小，

// 如果中間已經小於閾值，其後續節點都不需要計算

if (score < std_log(threshold)) {

return;

}

// 同理

if (heap.size() == k && score < heap.front().first) {

return;

}

// 如果是葉子節點，計算該label的概率，加入到堆中

if (tree[node].left == -1 && tree[node].right == -1) {

heap.push_back(std::make_pair(score, node));

std::push_heap(heap.begin(), heap.end(), comparePairs);

if (heap.size() > k) {

std::pop_heap(heap.begin(), heap.end(), comparePairs);

heap.pop_back();

}

return;

}

real f;

if (quant_ && args_->qout) {

f = qwo_->dotRow(hidden, node - osz_);

} else {

f = wo_->dotRow(hidden, node - osz_);

}

f = 1. / (1 + std::exp(-f));

// 根據當前的概率，分別遍歷左節點和右節點，屬於先根遍歷

dfs(k, threshold, tree[node].left, score + std_log(1.0 - f), heap, hidden);

dfs(k, threshold, tree[node].right, score + std_log(f), heap, hidden);

}

參數的更新，需要注意一個地方，見注釋

void Model::update(const std::vector<int32_t>& input, int32_t target, real lr) {

assert(target >= 0);

assert(target < osz_);

if (input.size() == 0) {

return;

}

computeHidden(input, hidden_);

if (args_->loss == loss_name::ns) {

loss_ += negativeSampling(target, lr);

} else if (args_->loss == loss_name::hs) {

loss_ += hierarchicalSoftmax(target, lr);

} else {

loss_ += softmax(target, lr);

}

nexamples_ += 1;

// 由於是加權平均，所有最後的梯度也需要平均一下

if (args_->model == model_name::sup) {

grad_.mul(1.0 / input.size());

}

最後一個，Huffman樹的構建，比較有意思，原理參考下面圖示，我們知道Huffman構建的過程，選擇兩個最小的沒有使用過的數，構建一個非葉子節點。如果數據是倒序排序會出現什麼情況呢。

沒錯，正如你所見，只要從中間往兩邊遍歷，因為兩邊的數據都是有序的，選擇最小的兩個數據，構建中間節點，源碼就是這種思路，一看就懂，有木有。

void Model::buildTree(const std::vector<int64_t>& counts) {

tree.resize(2 * osz_ - 1);

for (int32_t i = 0; i < 2 * osz_ - 1; i++) {

tree[i].parent = -1;

tree[i].left = -1;

tree[i].right = -1;

tree[i].count = 1e15;

tree[i].binary = false;

}

for (int32_t i = 0; i < osz_; i++) {

tree[i].count = counts[i];

}

int32_t leaf = osz_ - 1;

int32_t node = osz_;

// 中間節點往兩邊選擇最小的點

for (int32_t i = osz_; i < 2 * osz_ - 1; i++) {

int32_t mini[2];

for (int32_t j = 0; j < 2; j++) {

if (leaf >= 0 && tree[leaf].count < tree[node].count) {

mini[j] = leaf--;

} else {

mini[j] = node++;

}

tree[i].left = mini[0];

tree[i].right = mini[1];

tree[i].count = tree[mini[0]].count + tree[mini[1]].count;

tree[mini[0]].parent = i;

tree[mini[1]].parent = i;

tree[mini[1]].binary = true;

}

for (int32_t i = 0; i < osz_; i++) {

std::vector<int32_t> path;

std::vector<bool> code;

int32_t j = i;

while (tree[j].parent != -1) {

path.push_back(tree[j].parent - osz_);

code.push_back(tree[j].binary);

j = tree[j].parent;

}

paths.push_back(path);

codes.push_back(code);

}

4 源碼修改4.1 帶權重fastText

結構圖：

反向傳播：

class Attention {

protected:

std::vector<real> weight_softmax_; // 權重softmax值

std::vector<real> weight_grad_; // 權重梯度更新

int64_t size_;

const int32_t dim_;

public:

Attention(int32_t dim): dim_(dim) {}

explicit Attention(int64_t n, int32_t dim):

weight_softmax_(n),

weight_grad_(n),

size_(n),

dim_(dim) {}

// Attention(const Softmax&) = default;

// Attention& operator=(const Attention&) = delete;

void zero() {

std::fill(weight_softmax_.begin(), weight_softmax_.end(), 0.0);

std::fill(weight_grad_.begin(), weight_grad_.end(), 0.0);

}

std::vector<real>& getWeightSoftmax() {

return weight_softmax_;

}

std::vector<real>& getWeightGrad() {

return weight_grad_;

}

void resize(int64_t n) {

size_ = n;

weight_softmax_.resize(n);

weight_grad_.resize(n);

}

* 權重softmax

void calWeightSoftmax(const Matrix& A, const std::vector<int32_t>& input) {

if (input.size() > weight_softmax_.size()) {

weight_softmax_.resize(input.size());

}

real sum = 0;

real max = 0;

if (input.size() > 0)

max = A.getWeight(input[0]);

for ( int64_t i = 0; i < input.size(); ++i ) {

if ( max < A.getWeight(input[i]))

max = A.getWeight(input[i]);

}

for ( int64_t i = 0; i < input.size(); ++i ) {

weight_softmax_[i] = std::exp(A.getWeight(input[i]) - max);

sum += weight_softmax_[i];

}

for ( int64_t i = 0; i < input.size(); i++ ) {

weight_softmax_[i] /= sum;

}

* 計算權重的更新

void calWeightGrad(const Matrix& A, const std::vector<int32_t>& input,

const Vector& grad, const Vector& hidden) {

if (input.size() > weight_grad_.size()) {

weight_grad_.resize(input.size());

}

// std::fill(weight_grad_.begin(), weight_grad_.end(), 0.0);

for (int32_t j = 0; j < input.size(); ++j) {

real sum = 0;

for (int i = 0; i < dim_; ++i) {

sum += grad[i]*(A.at(input[j], i) - hidden[i]);

}

weight_grad_[j] = weight_softmax_[j]*sum;

}

for (int32_t i = 0; i < input.size(); i++) {

weight_grad_[i] /= A.size(1);

}

void computeHidden(const Matrix& A, const std::vector<int32_t>& input,

Vector& hidden) {

// 計算權重softmax

calWeightSoftmax(A, input);

auto it = input.cbegin();

auto it_s = weight_softmax_.cbegin();

for (; it != input.cend() && it_s != weight_softmax_.cend();

++it, ++it_s) {

hidden.addRow(A, *it, *it_s);

}

// hidden.mul(1.0 / input.size());

}

};

4.2 self-attetnion fastText

class Attention {

protected:

// 權重參數

std::vector<real> weight_softmax_; // 權重softmax值

// std::vector<real> weight_softmax_grad_mat_; // hi 對 wj 的梯度

std::vector<real> weight_grad_; // 權重梯度更新

int64_t size_;

const int32_t dim_;

// x self-attention 參數

std::vector<real> x_softmax_; // x softmax 權重

std::vector<real> y_; // yi=sum(wij*xj) x的self-attention

std::vector<real> hi_xjk_grad_; // hi 對 xj,k 的梯度其中j是固定的

std::vector<real> xj_grad_; // xj的梯度

std::vector<real> max_x_softmax_;

public:

Attention(int32_t dim): dim_(dim) {}

explicit Attention(int64_t n, int32_t dim):

weight_softmax_(n),

weight_grad_(n),

size_(n),

dim_(dim),

x_softmax_(n*n),

y_(n*dim),

hi_xjk_grad_(dim*dim),

xj_grad_(dim){}

// Attention(const Softmax&) = default;

// Attention& operator=(const Attention&) = delete;

void zero() {

std::fill(weight_softmax_.begin(), weight_softmax_.end(), 0.0);

std::fill(weight_grad_.begin(), weight_grad_.end(), 0.0);

}

std::vector<real>& getWeightSoftmax() {

return weight_softmax_;

}

std::vector<real>& getWeightGrad() {

return weight_grad_;

}

void resize(int64_t n) {

if (n > size_) {

size_ = n;

weight_softmax_.resize(n);

weight_grad_.resize(n);

x_softmax_.resize(n*n);

y_.resize(n*dim_);

}

* 權重softmax 歸一化

void calWeightSoftmax(const Matrix& A,

const std::vector<int32_t>& input) {

if (input.size() > weight_softmax_.size()) {

weight_softmax_.resize(input.size());

}

real sum = 0;

real max = 0;

if (input.size() > 0)

max = A.getWeight(input[0]);

for ( int64_t i = 0; i < input.size(); ++i ) {

if ( max < A.getWeight(input[i]))

max = A.getWeight(input[i]);

}

for ( int64_t i = 0; i < input.size(); ++i ) {

weight_softmax_[i] = std::exp(A.getWeight(input[i]) - max);

sum += weight_softmax_[i];

}

for ( int64_t i = 0; i < input.size(); i++ ) {

weight_softmax_[i] /= sum;

}

* 計算權重的更新

void calWeightGrad(const Matrix& A,

const std::vector<int32_t>& input,

const Vector& hidden,

const Vector& grad) {

if (input.size() > weight_grad_.size()) {

weight_grad_.resize(input.size());

}

// std::fill(weight_grad_.begin(), weight_grad_.end(), 0.0);

for (int32_t j = 0; j < input.size(); ++j) {

real tmp = 0;

for (int i = 0; i < dim_; ++i) {

tmp += grad[i]*(y_[j*dim_ + i]-hidden[i]);

}

weight_grad_[j] = weight_softmax_[j]*tmp;

}

void computeHidden(const Matrix& A, const std::vector<int32_t>& input,

Vector& hidden) {

// 計算權重softmax

calWeightSoftmax(A, input);

// 計算y值

calY(A, input);

hidden.zero();

for(int32_t i = 0; i < dim_; ++i) {

for( int32_t j=0; j<input.size(); ++j) {

hidden[i] += weight_softmax_[j]*y_[j*dim_+i];

}

// hidden.mul(1.0 / input.size());

}

// 計算self-attention 的權重 x_softmax_[i,j] = e^(xi*xj)/sum(e^xi*xk)

void calXSoftmax(const Matrix& A, const std::vector<int32_t>& input) {

if (input.size()*input.size() > x_softmax_.size())

x_softmax_.resize(input.size()*input.size());

std::fill(x_softmax_.begin(), x_softmax_.end(), 0.0);

if ( input.size() > max_x_softmax_.size())

max_x_softmax_.resize(input.size());

for(int32_t i = 0; i < input.size(); ++i) {

for(int32_t j = 0; j < input.size(); ++j) {

for(int32_t d = 0; d < dim_; ++d){

x_softmax_[i*input.size() + j] += A.at(input[i], d)*A.at(input[j], d);

}

if ( j == 0)

max_x_softmax_[i] = x_softmax_[i*input.size() + j];

else if (max_x_softmax_[i] < x_softmax_[i*input.size() + j])

max_x_softmax_[i] = x_softmax_[i*input.size() + j];

}

for(int32_t i = 0; i < input.size(); ++i) {

real sum = 0;

for(int32_t j = 0; j < input.size(); ++j) {

x_softmax_[i*input.size() + j]=

std::exp(x_softmax_[i*input.size() + j]

-max_x_softmax_[i]);

sum += x_softmax_[i*input.size() + j];

}

for(int32_t j = 0; j < input.size(); ++j) {

x_softmax_[i*input.size() + j] /= sum;

}

// 計算Y值， y[i] = sum(w[i,j]*xj)

void calY(const Matrix& A, const std::vector<int32_t>& input) {

if ( input.size()*dim_ > y_.size())

y_.resize(input.size()*dim_);

calXSoftmax(A, input);

std::fill(y_.begin(), y_.end(), 0.0);

for(int32_t i = 0; i < input.size(); ++i){

for(int32_t d = 0; d < dim_; ++d){

for(int32_t j = 0; j < input.size(); ++j){

y_[i*dim_ + d] += x_softmax_[i*input.size() + j]*A.at(input[j], d);

}

// 反向傳播 hi_xjk_grad_ hi 對 xjk 的梯度其中j是固定的

void caHiXjkGrad(const Matrix& A,

const std::vector<int32_t>& input, int32_t j) {

if (dim_*dim_ > hi_xjk_grad_.size())

hi_xjk_grad_.resize(dim_*dim_);

std::fill(hi_xjk_grad_.begin(), hi_xjk_grad_.end(), 0.0);

for (int32_t i = 0; i < dim_; ++i) {

for (int32_t k = 0; k < dim_; ++k) {

for (int32_t m = 0; m < input.size(); ++m) {

if ( i == k) {

hi_xjk_grad_[i*dim_ + k] +=

weight_softmax_[m]*x_softmax_[m*input.size() + j];

}

real sum = 0;

sum += A.at(input[m], k)*x_softmax_[m*input.size()

+ j]*(A.at(input[j], i) - y_[m*dim_ + i]);

if (m == j) {

sum -= y_[m*dim_ + k]*y_[m*dim_ + i];

for (int32_t l = 0; l < input.size(); ++l) {

sum += A.at(input[l], k)*x_softmax_[m*input.size()

+ l]*A.at(input[l], i);

}

hi_xjk_grad_[i*dim_ + k] += weight_softmax_[m]*sum;

}

// 反向傳播計算損失函數對xj的梯度

void calXGrad(const Matrix& A, const std::vector<int32_t>& input,

const Vector& grad, Vector& xgrad, int32_t id) {

caHiXjkGrad(A, input, id);

xgrad.zero();

for (int32_t k = 0; k < dim_; ++k) {

for (int32_t i = 0; i < dim_; ++i) {

xgrad[k] += grad[i]*hi_xjk_grad_[i*dim_ + k];

}

};

5 總結

fastText是一個簡單高效的文本分類工具，近兩年應用廣泛，用簡單模型的訓練時間得到比肩深度學習的效果。面對這麼優秀的開源工具，本文為你詳細剖析了其C++源碼實現，並在源碼基礎上做了兩個新的嘗試：

帶權重fasttext

self-attention fastText

這兩個嘗試前者並不會顯著增加訓練時間，但是後者會。當然這只是我們一時技癢，加了attention的fastText，就像是給AK47綁上了重機槍的架子，失去了其本來意義。當然，我們希望本文的源碼分析能給同行們一些有用的啟示，歡迎大家繼續互相交流討論NLP技術。

若有意加入我們團隊一起搞事情，把你亮閃閃的簡歷砸向chenkaijiang001@ke.com 。

fastText的源碼,看這一篇就夠了!

相關焦點

文本分類經典論文:fasttext,textcnn解讀

【NLP實戰】手把手帶你fastText文本分類

FastText的內部機制

fastText原理及實踐

專欄 | fastText原理及實踐

fastText,智慧與美貌並重的文本分類及向量化工具

HashMap面試題,看這一篇就夠了!

中考英語詞性變換大全,只看這一篇就夠了!

關於兒童安全座椅，看這一篇就夠了

Python正則表達式,這一篇就夠了!

立式攪拌器安裝教程看這一篇就夠了

使用Gensim來實現Word2Vec和FastText

Handler原理,這一篇就夠了

源碼詳解直播公開課 | ORBSLAM2中Oriented Fast神奇高效的代碼實現方式

淺談Vue.js中v-text,v-html,{{}}之間的異同

NLP中的詞向量對比:word2vec/glove/fastText/elmo/GPT/bert

今年流行穿衣風格夏天穿衣搭配看這一篇就夠了

深入分析java集合類LinkedList(源碼分析)

Flutter 常用widget之Text

fastText完美解決你的需求(上篇)

fastText的源碼,看這一篇就夠了!

相關焦點

文本分類經典論文:fasttext,textcnn解讀

【NLP實戰】手把手帶你fastText文本分類

FastText的內部機制

fastText原理及實踐

專欄 | fastText原理及實踐

fastText,智慧與美貌並重的文本分類及向量化工具

HashMap面試題,看這一篇就夠了!

中考英語詞性變換大全,只看這一篇就夠了!

關於兒童安全座椅，看這一篇就夠了

Python正則表達式,這一篇就夠了!

立式攪拌器安裝教程看這一篇就夠了

使用Gensim來實現Word2Vec和FastText

Handler原理,這一篇就夠了

源碼詳解直播公開課 | ORBSLAM2中Oriented Fast神奇高效的代碼實現方式

淺談Vue.js中v-text,v-html,{{}}之間的異同

NLP中的詞向量對比:word2vec/glove/fastText/elmo/GPT/bert

今年流行穿衣風格 夏天穿衣搭配看這一篇就夠了

深入分析java集合類LinkedList(源碼分析)

Flutter 常用widget之Text

fastText完美解決你的需求(上篇)

今年流行穿衣風格夏天穿衣搭配看這一篇就夠了