deephub翻譯組: Calab
FLORIZEL:Should she kneel be?In shall not weep received; unleased meAnd unrespective greeting than dwell in, thee,look』d on me, son in heavenly properly.
這是誰寫的,莎士比亞還是機器學習模型?
答案是後者!上面這篇文章是一個經過TensorFlow訓練的循環神經網絡的產物,經過30個epoch的訓練,並給出了一顆「FLORIZEL:」的種子。在本文中,我將解釋並給出如何訓練神經網絡來編寫莎士比亞戲劇或任何您希望它編寫的東西的代碼!
導入和數據
首先導入一些基本庫
import tensorflow as tf
import numpy as np
import os
import time
TensorFlow內置了莎士比亞作品。如果您在像Kaggle這樣的在線環境中工作,請確保連接了網際網路。
path_to_file = tf.keras.utils.get_file('shakespeare.txt', '')
數據需要用utf-8進行解碼。
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))
[輸出]:
Length of text: 1115394 characters
它裡面有很多的數據可以用!
我們看看前250個字符是什麼
print(text[:250])
向量化
首先看看文件裡面有多少不同的字符:
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))
65 unique characters
在訓練之前,字符串需要映射到數字表示。下面創建兩個表—一個表將字符映射到數字,另一個表將數字映射到字符。
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
text_as_int = np.array([char2idx[c] for c in text])
查看向量字典:
print('{')
for char,_ in zip(char2idx, range(20)):
print(' {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print(' ...\n}')
[輸出]:
{
'\n': 0,
' ' : 1,
'!' : 2,
'$' : 3,
'&' : 4,
"'" : 5,
',' : 6,
'-' : 7,
'.' : 8,
'3' : 9,
':' : 10,
...
}
每一個不一樣的字符都有了編號。
我們看看向量生成器如何處理作品的前兩個單詞 'First Citizen'
print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))
這些單詞被轉換成一個數字向量,這個向量可以很容易地通過整數到字符字典轉換回文本。
製造訓練數據
給定一個字符序列,該模型將理想地找到最有可能的下一個字符。文本將被分成幾個句子,每個輸入句子將包含文本中的一個可變的seq_length字符。任何輸入語句的輸出都將是輸入語句,向右移動一個字符。
例如,給定一個輸入「Hell」,輸出將是「ello」,從而形成單詞「Hello」。
首先,我們可以使用tensorflow的.from_tensor_slices函數將文本向量轉換為字符索引。
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)
# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
for i in char_dataset.take(5):
print(idx2char[i.numpy()])
F
i
r
s
t
批處理方法允許這些單個字符成為確定大小的序列,形成段落片段。
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)
for item in sequences.take(5):
print(repr(''.join(idx2char[item.numpy()])))
'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou ' 'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k' "now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki" "ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d" 'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'
對於每個序列,我們將複製它並使用map方法移動它以形成一個輸入和一個目標。
def split_input_target(chunk):
input_text = chunk[:-1]
target_text = chunk[1:]
return input_text, target_text
dataset = sequences.map(split_input_target)
現在,數據集已經變成了我們想要的輸入和輸出。
Input data: 'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target data: 'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
對向量的每個索引進行一次性處理;對於第0步的輸入,模型接收「F」的數值索引,並嘗試預測「i」作為下一個字符。在下一個時序步驟中,它做同樣的事情,但是RNN不僅考慮前面的步驟,而且還考慮它剛才預測的字符。
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
print("Step {:4d}".format(i))
print(" input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
print(" expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))
Step 0
input: 18 ('F')
expected output: 47 ('i')
Step 1
input: 47 ('i')
expected output: 56 ('r')
Step 2
input: 56 ('r')
expected output: 57 ('s')
Step 3
input: 57 ('s')
expected output: 58 ('t')
Step 4
input: 58 ('t')
expected output: 1 (' ')
Tensorflow的 tf.data 可以用來將文本分割成更易於管理的序列——但首先,需要將數據打亂並打包成批。
# Batch size
BATCH_SIZE = 64
# Buffer size to shuffle the dataset
BUFFER_SIZE = 10000
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
dataset
<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>
構建模型
最後,我們可以構建模型。讓我們先設定一些重要的變量:
# Length of the vocabulary in chars
vocab_size = len(vocab)
# The embedding dimension
embedding_dim = 256
# Number of RNN units
rnn_units = 1024
模型將有一個嵌入層或輸入層,該層將每個字符的數量映射到一個具有變量embedding_dim維數的向量。它將有一個GRU層(可以用LSTM層代替),大小為units = rnn_units。最後,輸出層將是一個標準的全連接層,帶有vocab_size輸出。
下面的函數幫助我們快速而清晰地創建一個模型。
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim,
batch_input_shape=[batch_size, None]),
tf.keras.layers.GRU(rnn_units,
return_sequences=True,
stateful=True,
recurrent_initializer='glorot_uniform'),
tf.keras.layers.Dense(vocab_size)
])
return model
通過調用函數組合模型架構。
model = build_model(
vocab_size = len(vocab),
embedding_dim=embedding_dim,
rnn_units=rnn_units,
batch_size=BATCH_SIZE)
讓我們總結一下我們的模型,看看有多少參數。
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (64, None, 256) 16640
gru (GRU) (64, None, 1024) 3938304
dense (Dense) (64, None, 65) 66625
Total params: 4,021,569
Trainable params: 4,021,569
Non-trainable params: 0
400萬的參數!我們希望把它訓練的久一點。
匯集
這個問題現在可以作為一個分類問題來處理。給定先前的RNN狀態和時間步長的輸入,預測表示下一個字符的類。因此,我們將附加一個稀疏分類熵損失函數和Adam優化器。
def loss(labels, logits):
return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
example_batch_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss: ", example_batch_loss.numpy().mean())
model.compile(optimizer='adam', loss=loss)
Prediction shape: (64, 100, 65) # (batch_size, sequence_length, vocab_size)
scalar_loss: 4.1746616
配置檢查點
模型訓練,尤其是像莎士比亞戲劇這樣的大型數據集,需要很長時間。理想情況下,我們不會為了做出預測而反覆訓練它。tf.keras.callbacks.ModelCheckpoint函數可以在訓練期間將某些檢查點的權重保存到一個文件中,該文件可以在一個空白模型被後續檢索。這在訓練因任何原因中斷時也很方便。
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_prefix,
save_weights_only=True)
最後,執行訓練
EPOCHS=30
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])
這應該需要大約6個小時的時間來獲得不那麼令人印象深刻但更快的結果,epochs可以調整到10(任何小於5的都會完全變成垃圾)。
生成文本
衝檢查點中恢復權重參數
tf.train.latest_checkpoint(checkpoint_dir)
用這些權重參數我們可以重新構建模型:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))
生成文本的步驟:
首先選擇一個種子字符串,初始化RNN狀態,並設置要生成的字符數。使用開始字符串和RNN狀態獲得下一個字符的預測分布。使用分類分布計算預測字符的索引,並將其作為模型的下一個輸入。模型返回的RNN狀態被反饋回自身。重複步驟2和步驟4,直到生成文本。def generate_text(model, start_string):
# Evaluation step (generating text using the learned model)
# Number of characters to generate
num_generate = 1000
# Converting our start string to numbers (vectorizing)
input_eval = [char2idx[s] for s in start_string]
input_eval = tf.expand_dims(input_eval, 0)
# Empty string to store our results
text_generated = []
# Low temperatures results in more predictable text.
# Higher temperatures results in more surprising text.
# Experiment to find the best setting.
temperature = 1.0
# Here batch size == 1
model.reset_states()
for i in range(num_generate):
predictions = model(input_eval)
# remove the batch dimension
predictions = tf.squeeze(predictions, 0)
# using a categorical distribution to predict the character returned by the model
predictions = predictions / temperature
predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
# We pass the predicted character as the next input to the model
# along with the previous hidden state
input_eval = tf.expand_dims([predicted_id], 0)
text_generated.append(idx2char[predicted_id])
return (start_string + ''.join(text_generated))
最後,給定一個開始字符串,我們可以生成一些有趣的文本。
現在,欣賞一下兩個RNN的劇本吧,一個是訓練了10個epochs,另一個是30個epochs。
這是訓練了10個epochs的
print(generate_text(model, start_string=u"ROMEO: "))
ROMEO: how I, away too put That you shall have thieffort, are but love.
JULIET: Go, fight, sir: we say 『Ay,』 and alack to stand and not to go to; And washt us him to-domm. Ay, my ows young; a man hear from his monsher to thee.
KING RICHARD III: Come, cease. O broteld the costime’s deforment! Thou wilt was quite.
PAULINA: I would you say the hour! Ah, hole for your company: But, good my lord; we have a king, of peace?
BALTHASAR: Cadul and washee could he ha! To curit her I may wench.
GLOUCESTER: Had you here shall such a pierce to temper; Or might his noble offery owe and speed Which seemest thy trims in a weaky amidude By this to the dother, dods citizens.
Third Citizen:Madam sweet give reward, rebeire them With news gone! Pluck yielding: ’tis sign out things Within risess in strifes all ten times, To dish his finmers for briefily.
JULIET:Gentlemen, God eveI come approbouting his wife as it, — triumphrous night change you gods, thou goest:To which will dispersed and France.
哇!僅僅在10個epochs之後,就有了令人印象深刻的理解。這些詞的拼寫準確性令人懷疑,但其中有明顯的情節衝突。寫作肯定可以改進。希望30-epoch模型能有更好的表現。
這是訓練了30個epochs的
欣賞一下完全由RNN一個字一個字地創作出來的作品吧!
BRUTUS:Could you be atherveshed him, our two,But much a tale lendly fear;For which we in thy shade of Naples.Here’s no increase False to’t, offorit is the war of white give again.This is the queen, whose vanoar’s head is worthly.But cere it be a witch, some comfort.What, nurse, I say!Go Hamell.
FLORIZEL:Should she kneel be?In shall not weep received; unleased meAnd unrespective greeting than dwell in, thee,look』d on me, son in heavenly properly,That ever you are my father is but straing;Unless you would repossess him, hath always louded up,You provokest. Good faith, o』erlar I can repart the heavens like deeds dillsFor temper as soon as another maiden here, and he is bann』d upon which springs;O』er most upon your voysus, I have no thunder; and my good villain!Alest each other’s sleepings.A fool; if this business prating dutyDoes these traitors other sorrow.
LUCENTIO:Tell me, they’s honourably.
Shepherd:I know, my lord, to London, and you my moved join under him,Great Apollo’s stan to make a book,Both yet my father away towards Covent. Tut, And thou still』d by the earthmen lord r sensible your mother?
Servant:Go, vill! We muster yet, for you』ll not: you are took good mad within your company in rage, I would you fight it so, his eye for every days,To swear the beam of such a detects,To Clarence dead to call upon you all I thank your grace, my father and my father, and yourself prevailsMy father, hath a sword for hither;Nor when thy heart is grown grave done.
*QUEEN MARGARET: *Thou art a lodging very good and give thanksWith him.But There is now in hand:Therefore it be possish』d with Romeo dead.
MENENIUS:Ha! little very welcome to my daughter’s sword,Which haply my prayer’s legs, such as he does.I am banks, sir, I』ll make you say 『nough; for hither so better now to be so, sent it: it is stranger.
哇!有趣的是,這個模型甚至學會了在某些情況下押韻(特別是Florizel的臺詞)。想像一下,在50甚至100個epochs之後,RNN能寫些什麼!
嗯,我猜想AI會讓作家失業
不完全是這樣——但我可以想像未來人工智慧會發表大量設計成病毒式傳播的文章。這是一個挑戰——收集與主題相關的頂級文章,比如Human Parts或其他類似出版物的文章,然後訓練人工智慧撰寫熱門文章。發布RNN的輸出,逐字地,看看效果如何!注意——我不建議在更專業的出版物上訓練RNN,比如Towards Data Science 或 Better Programming,因為它需要RNN在合理的時間內無法學習的技術知識。然而,在RNN目前的能力範圍內,更多的哲學和非技術的寫作還行。
隨著文本生成變得越來越先進,它將有潛力比人類寫得更好,因為它有一個眼睛,什麼內容將像病毒一樣,什麼措辭讓讀者感覺良好,等等。令人震驚的是,有一天,機器可以在人類最擅長的事情——寫作上擊敗人類。誠然,它無法真正理解自己在寫什麼,但它會掌握人類的交流方式。
我想如果你不能打敗他們,那就加入他們吧!