Creating one model for every pair of languages is obviously ridiculous: the number of deep models needed would reach the hundreds, each of which would need to be stored on a user’s phone or PC for efficient usage and/or offline use. Instead, Google decided to create one large neural network that could translate between any two languages, given two tokens (indicators; inputs) representing the languages.
為每對語言創建一個模型顯然是荒謬的:所需的深度模型數量將達到數百種,每種深度模型都需要存儲在用戶的手機或PC上,以便有效使用和/或離線使用。 取而代之的是,Google決定創建一個大型神經網絡,該網絡可以在兩種表示語言的標記(指示符;輸入)之間進行翻譯,從而可以在任何兩種語言之間進行翻譯。
The fundamental structure of the model is encoder-decoder. One segment of the neural network seeks to reduce one language into its fundamental, machine-readable 『universal representation』, whereas the other takes this universal representation and repeatedly transforms the underlying ideas in the output language. This is a 『Transformer Architecture』; the following graphic gives a good intuition of how it works, how previously generated content plays a role in generating following outputs, and its sequential nature.
該模型的基本結構是編碼器-解碼器。 神經網絡的一個部分試圖將一種語言簡化為其基本的,機器可讀的「通用表示」,而另一部分則採用這種通用表示並反覆地將輸出語言中的基本思想進行轉換。 這是一種「變壓器架構」; 下圖很好地說明了它的工作原理,先前生成的內容如何在生成後續輸出中發揮作用以及其順序性質。
Consider an alternative visualization of this encoder-decoder relationship (a seq2seq model). The intermediate attention between the encoder and decoder will be discussed later.
考慮此編碼器-解碼器關係的替代可視化(seq2seq模型)。 編碼器和解碼器之間的中間關注點將在後面討論。
The encoder consists of eight stacked LSTM layers. In a nutshell, LSTM is an improvement upon an RNN — a neural network designed for sequential data — that allows the network to 『remember』 useful information to make better future predictions. In order to address the non-sequential nature of language, the first two layers add bidirectionality. Pink nodes indicate a left-to-right reading, whereas green nodes indicate a right-to-left reading. This allows for GNMT to accommodate different grammar structures.
編碼器由八個堆疊的LSTM層組成。 簡而言之,LSTM是對RNN(針對順序數據設計的神經網絡)的改進,它使網絡能夠「記住」有用的信息,從而做出更好的未來預測。 為了解決語言的非順序性質,前兩層添加了雙向性。 粉色節點表示從左到右的讀數,而綠色節點表示從右到左的讀數。 這允許GNMT適應不同的語法結構。
The decoder model is also composed of eight LSTM layers. These seek to translate the encoded content into the new language.
解碼器模型也由八個LSTM層組成。 這些試圖將編碼的內容翻譯成新的語言。
An 『attention mechanism』 is placed between the two models. In humans, our attention helps us keep focus on a task by looking for answers to that task and not additional irrelevant information. In the GNMT model, the attention mechanism helps identify and amplify the importance of unknown segments of the message, which are prioritized in the decoding. This solves a large part of the 『rare words problem』, in which words that appear less often in the dataset are compensated with more attention.
在兩個模型之間放置一個「注意機制」。 在人類中,我們的注意力通過尋找該任務的答案而不是其他不相關的信息來幫助我們專注於一項任務。 在GNMT模型中,注意力機制有助於識別和放大消息中未知片段的重要性,這些片段在解碼時會優先處理。 這解決了「稀有詞問題」的很大一部分,其中在數據集中出現頻率較低的詞得到了更多關注。
Skip connections, or connections that jump over certain layers, were used to stimulate healthy gradient flow. As is with the ResNet (Residual Network) model, updating gradients may be caught up at one particular layer, affecting all the layers before it. With such a deep network comprising of 16 LSTMs in total, it is imperative not only for training time but for performance that skip connections be employed, allowing gradients to cross potentially problematic layers.
跳過連接或跳過某些層的連接用於刺激健康的梯度流。 與ResNet(殘差網絡)模型一樣,更新梯度可能會在一個特定的層上被捕獲,從而影響到它之前的所有層。 對於這樣一個由總共16個LSTM組成的深度網絡,不僅對於培訓時間而且對於性能而言,都必須跳過連接,從而允許梯度跨越可能存在問題的層。
The builders of GNMT invested lots of effort into developing an efficient low-level system that ran on TPU (Tensor Processing Unit), a specialized machine-learning hardware processor designed by Google, for optimal training.
GNMT的創建者投入了大量精力來開發一種高效的低級系統,該系統運行在TPU(張量處理單元)上,TPU是Google設計的專用機器學習硬體處理器,用於最佳培訓。
An interesting benefit of using one model to learn all the translations was that translations could be indirectly learned. For instance, if GNMT were trained only on English-to-Korean, Korean-to-English, Japanese-to-English, and English-to-Japanese, the model yielded good translations for Japanese-to-Korean and Korean-to-Japanese translation, even though it had never been directly trained on it. This is known as zero-shot learning, and significantly improved the required training time for deployment.
使用一種模型學習所有翻譯的一個有趣的好處是可以間接學習翻譯。 例如,如果GNMT僅接受了英語對韓語,韓語對英語,日語對英語和英語對日語的培訓,那麼該模型就可以很好地為日語對韓語和朝鮮語對英語進行翻譯日語翻譯,即使從未接受過日語翻譯。 這被稱為零擊學習,並且大大縮短了部署所需的培訓時間。
Heavy pre-processing and post-processing is done on the inputs and outputs of the GNMT model in order to support, for example, the highly specialized characters found often in Asian languages. Inputs are tokenized according to a custom-designed system, with word segmentation and markers for the beginning, middle, and end of a word. These additions made the bridge between different fundamental representations of language more fluid.
對GNMT模型的輸入和輸出進行大量的預處理和後處理,以例如支持亞洲語言中經常出現的高度專業化的字符。 輸入是根據定製設計的系統標記的,帶有單詞分段和單詞開頭,中間和結尾的標記。 這些添加使語言的不同基本表示之間的橋梁更加流暢。
For training data, Google used documents from the United Nations and the European Parliament’s documents and transcripts. Since these organizations contained information professionally translated between many languages — with high quality (imagine the dangers of a badly translated declaration) — this data was a good starting point. Later on, Google began using user (『community』) input to strengthen cultural-specific, slang, and informal language in its model.
對於培訓數據,Google使用了來自聯合國的文件以及歐洲議會的文件和成績單。 由於這些組織包含在多種語言之間進行專業翻譯的信息(質量很高(想像翻譯不當的危險),因此這些數據是一個很好的起點。 後來,Google開始使用用戶(「社區」)輸入來增強其模型中特定於文化的,語和非正式語言。
GNMT was evaluated on a variety of metrics. During training, GNMT used log perplexity. Perplexity is a form of entropy, particularly 『Shannon entropy』, so it may be easier to start from there. Entropy is the average number of bits to encode the information contained in a variable, and so perplexity is how well a probability model can predict a sample. One example of perplexity would be the number of characters a user must type into a search box for a query proposer to be at least 70% sure the user will type any one query. It is a natural choice for evaluating NLP tasks and models.
對GNMT進行了多種評估。 在訓練期間,GNMT使用了日誌困惑。 困惑是熵的一種形式,特別是「香農熵」,因此從那裡開始可能更容易。 熵是對變量中包含的信息進行編碼的平均位數,因此困惑度是概率模型預測樣本的能力。 困惑的一個例子是,用戶必須在搜索框中鍵入一個字符數,查詢提議者才能至少確保70%的用戶可以鍵入任何一個查詢。 這是評估NLP任務和模型的自然選擇。
The standard BLEU score for language translation attempts to measure how close the translation was to a human one, on a scale from 0 to 1, using a string-matching algorithm. It is still widely used because it has shown strong correlation with human-rated performance: correct words are rewarded, with bonuses for consecutive correct words and longer/more complex words.
語言翻譯的標準BLEU分數試圖使用字符串匹配算法以0到1的比例來衡量翻譯與人類翻譯的接近程度。 它仍被廣泛使用,因為它已顯示出與人類評價的性能密切相關:獎勵正確的單詞,並為連續正確的單詞和更長/更複雜的單詞提供獎勵。
However, it assumes that a professional human translation is the ideal translation, only evaluates a model on select sentences, and does not have much robustness to different phrasing or synonyms. This is why a high BLEU score (>0.7) is usually a sign of overfitting.
但是,它假定專業的人工翻譯是理想的翻譯,僅對所選句子評估模型,並且對不同的措詞或同義詞沒有足夠的魯棒性。 這就是為什麼BLEU分數較高(> 0.7)通常表示過度擬合的原因。
Regardless, an increase in BLEU score (represented as a fraction) has shown an increase in language-modelling power, as demonstrated below:
無論如何,BLEU分數的提高(表示為分數)顯示出語言建模能力的提高,如下所示:
Using the developments of GNMT, Google launched an extension that could perform visual real-time translation of foreign text. One network identified potential letters, which were fed into a convolutional neural network for recognition. The recognized words are then fed into GNMT for recognition and rendered in the same font and style as the original.
藉助GNMT的發展,Google推出了一項擴展程序,可以執行外文的可視實時翻譯。 一個網絡識別出潛在的字母,然後將其輸入到卷積神經網絡中進行識別。 然後將識別出的單詞輸入到GNMT中進行識別,並以與原始字體相同的字體和樣式進行呈現。
One can only imagine the difficulties abound in creating such a service: identifying individual letters, piecing together words, determining the size and font of text, properly rendering the image.
人們只能想像創建此類服務時會遇到很多困難:識別單個字母,將單詞拼湊在一起,確定文本的大小和字體,正確渲染圖像。
GNMT appears in many other applications, sometimes with a different architecture. Fundamentally, however, GNMT represents a milestone in NLP, with the wonders of a lightweight yet effective design building upon years of NLP breakthroughs incredibly accessible to everyone.
GNMT出現在許多其他應用程式中,有時具有不同的體系結構。 但是,從根本上講,GNMT代表了NLP的裡程碑,其奇蹟在於輕巧而有效的設計基於多年的NLP突破,每個人都難以置信。