詳解深度強化學習展現TensorFlow 2.0新特性(代碼)

2020-12-16 新智元

新智元報導

來源：Reddit

作者：Roman Ring

編輯：三石、肖琴

【新智元導讀】自TensorFlow官方發布其2.0版本新性能以來，不少人可能對此會有些許困惑。因此博主Roman Ring寫了一篇概述性的文章，通過實現深度強化學習算法來具體的展示了TensorFlow 2.0的特性。

正所謂實踐出真知。

TensorFlow 2.0的特性公布已經有一段時間了，但很多人對此應當還是一頭霧水。

在本教程中，作者通過深度強化學習(DRL)來展示即將到來的TensorFlow 2.0的特性，具體來講就是通過實現優勢actor-critic(演員-評判家，A2C)智能體來解決經典的CartPole-v0環境。

雖然作者本文的目標是展示TensorFlow 2.0，但他先介紹了DRL方面的內容，包括對該領域的簡要概述。

事實上，由於2.0版本的主要關注點是簡化開發人員的工作，即易用性，所以現在正是使用TensorFlow進入DRL的好時機。

本文完整代碼資源連結：GitHub：https://github.com/inoryy/tensorflow2-deep-reinforcement-learning

Google Colab：https://colab.research.google.com/drive/12QvW7VZSzoaF-Org-u-N6aiTdBN5ohNA

安裝

由於TensorFlow 2.0仍處於試驗階段，建議將其安裝在一個獨立的(虛擬)環境中。我比較傾向於使用Anaconda，所以以此來做說明：

> conda create -n tf2 python=3.6> source activate tf2> pip install tf-nightly-2.0-preview # tf-nightly-gpu-2.0-preview for GPU version

讓我們來快速驗證一下，一切是否按著預測正常工作：

>>> import tensorflow as tf>>> print(tf.__version__)1.13.0-dev20190117>>> print(tf.executing_eagerly())True

不必擔心1.13.x版本，這只是一個早期預覽。此處需要注意的是，默認情況下我們是處於eager模式的！

>>> print(tf.reduce_sum([1, 2, 3, 4, 5]))tf.Tensor(15, shape=(), dtype=int32)

如果讀者對eager模式並不熟悉，那麼簡單來講，從本質上它意味著計算是在運行時(runtime)被執行的，而不是通過預編譯的圖(graph)來執行。讀者也可以在TensorFlow文檔中對此做深入了解：

https://www.tensorflow.org/tutorials/eager/eager_basics

深度強化學習

一般來說，強化學習是解決順序決策問題的高級框架。RL智能體通過基於某些觀察採取行動來導航環境，並因此獲得獎勵。大多數RL算法的工作原理是最大化智能體在一個軌跡中所收集的獎勵的總和。

基於RL的算法的輸出通常是一個策略—一個將狀態映射到操作的函數。有效的策略可以像硬編碼的no-op操作一樣簡單。隨機策略表示為給定狀態下行為的條件概率分布。

Actor-Critic方法

RL算法通常根據優化的目標函數進行分組。基於值的方法（如DQN）通過減少預期狀態-動作值(state-action value)的誤差來工作。

策略梯度(Policy Gradient)方法通過調整其參數直接優化策略本身，通常是通過梯度下降。完全計算梯度通常是很困難的，所以通常用蒙特卡洛(monte-carlo)方法來估計梯度。

最流行的方法是二者的混合：actor- critical方法，其中智能體策略通過「策略梯度」進行優化，而基於值的方法則用作期望值估計的引導。

深度actor- critical方法

雖然很多基礎的RL理論是在表格案例中開發的，但現代RL幾乎完全是用函數逼近器完成的，例如人工神經網絡。具體來說，如果策略和值函數用深度神經網絡近似，則RL算法被認為是「深度的」。

異步優勢(asynchronous advantage) actor- critical

多年來，為了解決樣本效率和學習過程的穩定性問題，已經為此做出了一些改進。

首先，梯度用回報(return)來進行加權：折現的未來獎勵，這在一定程度上緩解了信用(credit)分配問題，並以無限的時間步長解決了理論問題。

其次，使用優勢函數代替原始回報。收益與基線(如狀態行動估計)之間的差異形成了優勢，可以將其視為與某一平均值相比某一給定操作有多好的衡量標準。

第三，在目標函數中使用額外的熵最大化項，以確保智能體充分探索各種策略。本質上，熵以均勻分布最大化，來測量概率分布的隨機性。

最後，並行使用多個worker來加速樣品採集，同時在訓練期間幫助將它們去相關(decorrelate)。

將所有這些變化與深度神經網絡結合起來，我們得到了兩種最流行的現代算法：異步優勢actor- critical算法，或簡稱A3C/A2C。兩者之間的區別更多的是技術上的而不是理論上的：顧名思義，它歸結為並行worker如何估計其梯度並將其傳播到模型中。

有了這些，我將結束我們的DRL方法之旅，因為這篇博客文章的重點是TensorFlow 2.0特性。如果您仍然不確定主題，不要擔心，通過代碼示例，一切都會變得更加清晰明了。

使用TensorFlow 2.0實現Advantage Actor-Critic

讓我們看看實現各種現代DRL算法的基礎是什麼：是actor-critic agent，如前一節所述。為了簡單起見，我們不會實現並行worker，儘管大多數代碼都支持它。感興趣的讀者可以將這作為一個練習機會。

作為一個測試平臺，我們將使用CartPole-v0環境。雖然有點簡單，但它仍然是一個很好的選擇。

通過Keras模型API實現的策略和價值

首先，讓我們在單個模型類下創建策略和價值預估神經網絡:

import numpy as npimport tensorflow as tfimport tensorflow.keras.layers as klclassProbabilityDistribution(tf.keras.Model):defcall(self, logits):# sample a random categorical action from given logitsreturn tf.squeeze(tf.random.categorical(logits, 1), axis=-1)classModel(tf.keras.Model):def__init__(self, num_actions):super().__init__('mlp_policy')# no tf.get_variable(), just simple Keras API self.hidden1 = kl.Dense(128, activation='relu') self.hidden2 = kl.Dense(128, activation='relu') self.value = kl.Dense(1, name='value')# logits are unnormalized log probabilities self.logits = kl.Dense(num_actions, name='policy_logits') self.dist = ProbabilityDistribution()defcall(self, inputs):# inputs is a numpy array, convert to Tensor x = tf.convert_to_tensor(inputs, dtype=tf.float32)# separate hidden layers from the same input tensor hidden_logs = self.hidden1(x) hidden_vals = self.hidden2(x)return self.logits(hidden_logs), self.value(hidden_vals)defaction_value(self, obs):# executes call() under the hood logits, value = self.predict(obs) action = self.dist.predict(logits)# a simpler option, will become clear later why we don't use it# action = tf.random.categorical(logits, 1)return np.squeeze(action, axis=-1), np.squeeze(value, axis=-1)

然後驗證模型是否如預期工作：

import gymenv = gym.make('CartPole-v0')model = Model(num_actions=env.action_space.n)obs = env.reset()# no feed_dict or tf.Session() needed at allaction, value = model.action_value(obs[None, :])print(action, value) # [1] [-0.00145713]

這裡需要注意的是：

模型層和執行路徑是分別定義的沒有「輸入」層，模型將接受原始numpy數組通過函數API可以在一個模型中定義兩個計算路徑模型可以包含一些輔助方法，比如動作採樣在eager模式下，一切都可以從原始numpy數組中運行

Random Agent

現在讓我們轉到 A2CAgent 類。首先，讓我們添加一個 test 方法，該方法運行完整的episode並返回獎勵的總和。

classA2CAgent:def__init__(self, model):self.model = modeldeftest(self, env, render=True): obs, done, ep_reward = env.reset(), False, 0whilenot done: action, _ = self.model.action_value(obs[None, :]) obs, reward, done, _ = env.step(action) ep_reward += rewardif render: env.render()return ep_reward

讓我們看看模型在隨機初始化權重下的得分：

agent = A2CAgent(model)rewards_sum = agent.test(env)print("%d out of 200" % rewards_sum) # 18 out of 200

離最佳狀態還很遠，接下來是訓練部分!

損失/目標函數

正如我在DRL概述部分中所描述的，agent通過基於某些損失(目標)函數的梯度下降來改進其策略。在 actor-critic 中，我們針對三個目標進行訓練：利用優勢加權梯度加上熵最大化來改進策略，以及最小化價值估計誤差。

import tensorflow.keras.losses as klsimport tensorflow.keras.optimizers as koclassA2CAgent:def__init__(self, model):# hyperparameters for loss termsself.params = {'value': 0.5, 'entropy': 0.0001} self.model = model self.model.compile( optimizer=ko.RMSprop(lr=0.0007),# define separate losses for policy logits and value estimate loss=[self._logits_loss, self._value_loss] )deftest(self, env, render=True):# unchanged from previous section ...def_value_loss(self, returns, value):# value loss is typically MSE between value estimates and returnsreturn self.params['value']*kls.mean_squared_error(returns, value)def_logits_loss(self, acts_and_advs, logits):# a trick to input actions and advantages through same API actions, advantages = tf.split(acts_and_advs, 2, axis=-1)# polymorphic CE loss function that supports sparse and weighted options# from_logits argument ensures transformation into normalized probabilities cross_entropy = kls.CategoricalCrossentropy(from_logits=True)# policy loss is defined by policy gradients, weighted by advantages# note: we only calculate the loss on the actions we've actually taken# thus under the hood a sparse version of CE loss will be executed actions = tf.cast(actions, tf.int32) policy_loss = cross_entropy(actions, logits, sample_weight=advantages)# entropy loss can be calculated via CE over itself entropy_loss = cross_entropy(logits, logits)# here signs are flipped because optimizer minimizesreturn policy_loss - self.params['entropy']*entropy_loss

我們完成了目標函數！注意代碼非常緊湊：注釋行幾乎比代碼本身還多。

Agent Training Loop

最後，還有訓練環路。它有點長，但相當簡單：收集樣本，計算回報和優勢，並在其上訓練模型。

classA2CAgent:def__init__(self, model):# hyperparameters for loss termsself.params = {'value': 0.5, 'entropy': 0.0001, 'gamma': 0.99}# unchanged from previous section ...deftrain(self, env, batch_sz=32, updates=1000):# storage helpers for a single batch of data actions = np.empty((batch_sz,), dtype=np.int32) rewards, dones, values = np.empty((3, batch_sz)) observations = np.empty((batch_sz,) + env.observation_space.shape)# training loop: collect samples, send to optimizer, repeat updates times ep_rews = [0.0] next_obs = env.reset()for update in range(updates):for step in range(batch_sz): observations[step] = next_obs.copy() actions[step], values[step] = self.model.action_value(next_obs[None, :]) next_obs, rewards[step], dones[step], _ = env.step(actions[step]) ep_rews[-1] += rewards[step]if dones[step]: ep_rews.append(0.0) next_obs = env.reset() _, next_value = self.model.action_value(next_obs[None, :]) returns, advs = self._returns_advantages(rewards, dones, values, next_value)# a trick to input actions and advantages through same API acts_and_advs = np.concatenate([actions[:, None], advs[:, None]], axis=-1)# performs a full training step on the collected batch# note: no need to mess around with gradients, Keras API handles it losses = self.model.train_on_batch(observations, [acts_and_advs, returns])return ep_rewsdef_returns_advantages(self, rewards, dones, values, next_value):# next_value is the bootstrap value estimate of a future state (the critic) returns = np.append(np.zeros_like(rewards), next_value, axis=-1)# returns are calculated as discounted sum of future rewardsfor t in reversed(range(rewards.shape[0])): returns[t] = rewards[t] + self.params['gamma'] * returns[t+1] * (1-dones[t]) returns = returns[:-1]# advantages are returns - baseline, value estimates in our case advantages = returns - valuesreturn returns, advantagesdeftest(self, env, render=True):# unchanged from previous section ...def_value_loss(self, returns, value):# unchanged from previous section ...def_logits_loss(self, acts_and_advs, logits):# unchanged from previous section ...

訓練&結果

我們現在已經準備好在CartPole-v0上訓練這個single-worker A2C agent！訓練過程應該只用幾分鐘。訓練結束後，你應該看到一個智能體成功地實現了200分的目標。

rewards_history = agent.train(env)print("Finished training, testing...")print("%d out of 200" % agent.test(env)) # 200 out of 200

在原始碼中，我包含了一些額外的幫助程序，可以列印出正在運行的episode的獎勵和損失，以及rewards_history。

靜態計算圖

eager mode效果這麼好，你可能會想知道靜態圖執行是否也可以。當然是可以！而且，只需要多加一行代碼就可以啟用靜態圖執行。

with tf.Graph().as_default():print(tf.executing_eagerly()) # False model = Model(num_actions=env.action_space.n) agent = A2CAgent(model) rewards_history = agent.train(env) print("Finished training, testing...") print("%d out of 200" % agent.test(env)) # 200 out of 200

有一點需要注意的是，在靜態圖執行期間，我們不能只使用 Tensors，這就是為什麼我們需要在模型定義期間使用CategoricalDistribution的技巧。

One More Thing…

還記得我說過TensorFlow在默認情況下以eager 模式運行，甚至用一個代碼片段來證明它嗎？好吧,我騙了你。

如果你使用Keras API來構建和管理模型，那麼它將嘗試在底層將它們編譯為靜態圖。所以你最終得到的是靜態計算圖的性能，它具有eager execution的靈活性。

你可以通過model.run_eager標誌檢查模型的狀態，還可以通過將此標誌設置為True來強制使用eager mode，儘管大多數情況下可能不需要這樣做——如果Keras檢測到沒有辦法繞過eager mode，它將自動退出。

為了說明它確實是作為靜態圖運行的，這裡有一個簡單的基準測試：

# create a 100000 samples batchenv = gym.make('CartPole-v0')obs = np.repeat(env.reset()[None, :], 100000, axis=0)

Eager Benchmark

%%timemodel = Model(env.action_space.n)model.run_eagerly = Trueprint("Eager Execution: ", tf.executing_eagerly())print("Eager Keras Model:", model.run_eagerly)_ = model(obs)######## Results #######Eager Execution: TrueEager Keras Model: TrueCPU times: user 639 ms, sys: 736 ms, total: 1.38 s

Static Benchmark

%%timewith tf.Graph().as_default():model = Model(env.action_space.n) print("Eager Execution: ", tf.executing_eagerly()) print("Eager Keras Model:", model.run_eagerly) _ = model.predict(obs)######## Results #######Eager Execution: FalseEager Keras Model: FalseCPU times: user 793 ms, sys: 79.7 ms, total: 873 ms

Default Benchmark

%%timemodel = Model(env.action_space.n)print("Eager Execution: ", tf.executing_eagerly())print("Eager Keras Model:", model.run_eagerly)_ = model.predict(obs)######## Results #######Eager Execution: TrueEager Keras Model: FalseCPU times: user 994 ms, sys: 23.1 ms, total: 1.02 s

正如你所看到的，eager模式位於靜態模式之後，默認情況下，模型確實是靜態執行的。

結論

希望本文對理解DRL和即將到來的TensorFlow 2.0有所幫助。請注意，TensorFlow 2.0仍然只是預覽版的，一切都有可能發生變化，如果你對TensorFlow有什麼特別不喜歡(或喜歡:))的地方，請反饋給開發者。

一個總被提起的問題是，TensorFlow是否比PyTorch更好？也許是，也許不是。兩者都是很好的庫，所以很難說是哪一個更好。如果你熟悉PyTorch，你可能會注意到TensorFlow 2.0不僅趕上了它，而且還避免了PyTorch API的一些缺陷。

無論最後誰勝出，對於開發者來說，這場競爭給雙方都帶來了淨積極的結果，我很期待看到這些框架未來會變成什麼樣子。

新智元AI技術+產業社群招募中，歡迎對AI技術+產業落地感興趣的同學，加小助手_2 入群;通過審核後我們將邀請進群，加入社群後務必修改群備註（姓名 - 公司 - 職位;專業群審核較嚴，敬請諒解）。

詳解深度強化學習展現TensorFlow 2.0新特性(代碼)

相關焦點

對比深度學習十大框架:TensorFlow 並非最好?

深度| 對比深度學習十大框架:TensorFlow最流行但並不是最好

推薦| ThoughtWorks 大牛教你入門 Tensorflow

谷歌重磅推出TensorFlow Graphics:為3D圖像任務打造的深度學習利器

tensorflow2.4的重大改進

TensorFlow 2.0開源工具書,30天「無痛」上手

TensorFlow 2.1指南:keras模式、渴望模式和圖形模式(附代碼)

TensorFlow和Caffe、MXNet、Keras等其他深度學習框架的對比

從框架優缺點說起,這是一份TensorFlow入門極簡教程

代碼+實戰:TensorFlow Estimator of Deep CTR——DeepFM/NFM/AFM/...

從星際2深度學習環境到神經機器翻譯,上手機器學習這些開源項目必...

2020,PyTorch真的趕上TensorFlow了嗎?

五分鐘喝不完一杯咖啡,但五分鐘可以帶你入門TensorFlow

將Tensorflow 圖序列化以及反序列化的巧妙方法

資料|《 21 個項目玩轉深度學習——基於TensorFlow 的實踐詳解》

Transformers2.0讓你三行代碼調用語言模型,兼容TF2.0和PyTorch

谷歌剛發布的深度學習動態計算圖工具TensorFlowFold是什麼?

這裡有一份TensorFlow2.0中文教程

Tensorflow還是PyTorch?哪一個才更適合編程實現深度神經網絡?

PyTorch 1.6、TensorFlow 2.3、Pandas 1.1 同日發布!都有哪些新...

詳解深度強化學習展現TensorFlow 2.0新特性(代碼)

相關焦點

對比深度學習十大框架:TensorFlow 並非最好?

深度| 對比深度學習十大框架:TensorFlow最流行但並不是最好

推薦| ThoughtWorks 大牛教你入門 Tensorflow

谷歌重磅推出TensorFlow Graphics:為3D圖像任務打造的深度學習利器

tensorflow2.4的重大改進

TensorFlow 2.0開源工具書,30天「無痛」上手

TensorFlow 2.1指南:keras模式、渴望模式和圖形模式(附代碼)

TensorFlow和Caffe、MXNet、Keras等其他深度學習框架的對比

從框架優缺點說起,這是一份TensorFlow入門極簡教程

代碼+實戰:TensorFlow Estimator of Deep CTR——DeepFM/NFM/AFM/...

從星際2深度學習環境到神經機器翻譯,上手機器學習這些開源項目必...

2020,PyTorch真的趕上TensorFlow了嗎?

五分鐘喝不完一杯咖啡,但五分鐘可以帶你入門TensorFlow

將Tensorflow 圖序列化以及反序列化的巧妙方法

資料|《 21 個項目玩轉深度學習——基於TensorFlow 的實踐詳解 》

Transformers2.0讓你三行代碼調用語言模型,兼容TF2.0和PyTorch

谷歌剛發布的深度學習動態計算圖工具TensorFlowFold是什麼?

這裡有一份TensorFlow2.0中文教程

Tensorflow還是PyTorch?哪一個才更適合編程實現深度神經網絡?

PyTorch 1.6、TensorFlow 2.3、Pandas 1.1 同日發布!都有哪些新...

資料|《 21 個項目玩轉深度學習——基於TensorFlow 的實踐詳解》