深度強化學習筆記-Policy Based Method 3 - Proximal Policy Optimization

2021-03-02 波哥咋也機器學習

Deep Reinforcement learning - Policy-Based Method 3: Proximal Policy Optimization

In this new series of Machine Learning Notebook, I start taking notes on Deep Reinforcement Learning, in English.

In this series, I will start with some of the 'recent' concepts in Deep Reinforcement Learning. That means we will not start with the very basic or 'ancient' approach of reinforcement learning, but with the exciting 'recent' approaches that catch many people's eyes. Yet it does not mean that the 'ancient' part is not important, if necessary, we will cover them gradually in the future articles.

I choose to write in English mainly because this thriving domain of Machine Learning is currently getting much more attention in the English speaking country and many discussions and materials are in English. Therefore it is necessary to get used to the way of discussing the related concepts in English.

Intro

In the previous blog, we summarized the idea of the REINFORCE algorithm, in which we update the policy function weights by gradient information of expected reward function.

However, this most basic policy gradient method has several problems and we will show what they are and the important tweaks in addition to it.

Some of these important tweaks lead us to Proximal Policy  Optimization (PPO) and Trust  Region Policy Optimization (TRPO). These tweaks have allowed faster and more stable learning. PPO is now the most applied benchmark algorithm used by OpenAI and many ideas are very representative when engaging with the problems of state-of-art DRL algorithms.


Beyond Reinforcement (Monte Carlo)

Let's firstly revisit the key ingredients of the REINFORCE algorithm:

First, initialize a random policy

Second, we compute the total reward of the trajectory

Third, we update our policy using gradient  ascent with learning rate

The process then repeats.

There are some main problems of REINFORCE:

The update process is very inefficient. The policy was run once, update once, and then it was thrown away.

The gradient estimate noisy. By chance, the collected trajectory may not be representative of the policy.

There is no clear credit assignment. A trajectory may contain many good/bad actions and whether these actions are reinforced depends only on the final total output.

PPO proposed its solution to all these problems and we will cover them respectively below.

Noise Reduction

In the REINFORCE algorithm, we optimize the policy by maximizing the expected rewards

Instead of using millions or infinite of trajectories as noted by the mathematic equation, we simply take one trajectory to compute the gradient and update our policy.

Thus, this alternative makes our update comes down to chance, sometimes the only collected trajectory simply does not contain useful information. The hope is that after training for a long time, the tiny signal accumulates.

The easiest option to reduce the noise in gradient is to simply sample more trajectories. Using distributed computing, we can collect multiple trajectories in parallel. Then we can estimate the policy gradient by averaging across all the different trajectories.

Rewards Normalization

There is another bonus for running multiple trajectories:  we can collect all the total rewards and get a sense of how they are distributed.

In many cases, the distribution of rewards shifts as learning happens.  Reward= 1  might be really good in the beginning, but really bad after 1000 training episodes.

Learning can be improved if we normalize the rewards,

where

Intuitively, normalizing the rewards roughly corresponds to picking half the actions to encourage/discourage, while also making sure the steps for gradient ascents are not too large/small.

Credit assignment

Going back to the gradient estimate, we can take a closer look at the total reward

Let's think about what happens at time-step t. Even before action is decided, the agent has already received all the rewards up until step

Because we have a Markov process,  the action at  time-step

Notes on Gradient Modification

It turns out that mathematically, ignoring past rewards might change the gradient for each specific trajectory, but it doesn't change the averaged gradient. So even though the gradient is different during training, on average we are still maximizing the average reward. In fact, the resultant gradient is less noisy. So training using future rewards should speed things up.

Importance Sampling

In the REINFORCE algorithm, we start with policy,

At this point, the trajectories we've just generated are simply thrown away.  If we want to update our policy again, we would need to generate new trajectories once more, using the updated policy.

In fact, we need to compute the gradient for the current policy, and to do that the trajectories need to be representative of the current policy.

We could just reuse the recycled trajectories to compute gradients, and update the policy, again and again.

However, using the recycled trajectories (off-policy) has some problem since the collected old samples may have a distribution that is somewhat different from the current trajectories the agent will encounter. Hence we need some kind of sampling technique to ease the pain. This is where the importance sampling comes in. if we consider the trajectories collected by the old policy

If we want to compute the average of some quantity, say

Now we could rearrange this equation, by multiplying and dividing by the same number and rearrange the terms.

written in this way we can reinterpret the first part as the coefficient for sampling under the old policy, with an extra re-weighting factor, in addition to just averaging.

from [2]

Intuitively, this tells us we can use old trajectories for computing averages for a new policy,  as long as we add this extra re-weighting factor, that takes into account how under or over-represented each trajectory is under the new policy compared to the old one.

The re-weighting factor

When we take a  closer look at the re-weighting factor.

Statistically, each probability here contains a chain of products of each policy at different time-step, as the equation shown above.

If we estimate

For instance, when some policy gets quite close to 0, the re-weighting factor can become close to zero or infinity. This will make the re-weighting trick unreliable.

In order to cope with the problem, a trick by introducing a surrogate function was used in PPO.

The surrogate Function (Proximal Policy)

Let's take a look at our policy gradient again, by re-writing the derivative of the log term, we have:

Then, by re-arranging these equations, we replace the

Now, the idea of the proximal policy comes in. Here we assume that if the old and the current policy is close enough to each other, we would like to treat all the factors in the "..." as a number very close to 1, and only leave the terms shown as below:

At last, we approximate the re-weighted factor with the output of the two policies by the same

With this approximated gradient, we can think of it as the gradient of a new object, called the surrogate function

therefore, we aim to maximize this surrogate function with the proximal gradient.

Now there left another important issue, if the two distributions of the policy differ too much, the assumptions that we made previously do not valid anymore. Hence there must be some constraints to avoid that from happening.

To cope with this, one of the solutions is to introduce KL Divergence as regularization.

KL Divergence as regularization (TRPO and PPO)

In the original papers of TRPO(predecessor of PPO) and PPO, despite all the complex mathematical derivations, they mainly tried to add KL divergence between policies as a constraint of optimization.

In PPO, the optimization objective is like this

in which it makes the KL divergence term as a differentiable regularization term, and this makes the optimization process a bit easier.

from [2] 

In TRPO, it treats the KL divergence as a more general constraint condition, which is mathematically more precise but very hard to optimize.

In practice, researchers found that the approach of PPO has very similar results as TRPO, yet the former is relatively much easier to implement.

Clipping Policy Update (PPO-2)

In PPO-2, we replace the KL divergence regularization with clipping Policy Update.

When the policy distribution differs too much, it is highly like that the policy in which you sample your trajectories will lead to cliff jumping effect in the current learning policy. In such a case, it could be impossible to jump out of the bad policy plateau.

Hence, an intuitive idea to deal with cliff jumping is to give restrictions to surrogate function.

The formula of clipped surrogate function is

We want to make sure the two policy

from [2]

The whole Clipping surrogate function could be implemented as follows:


def clipped_surrogate(policy, old_probs, states, actions, rewards, discount = 0.995, epsilon=0.1, beta=0.01):

discounts=discount**np.arange(len(rewards)) Reward=np.asarray(rewards)*discounts[:,np.newaxis] Reward_future=Reward[::-1].cumsum(axis=0)[::-1]
R_mean=Reward_future.mean(axis=1) R_std=Reward_future.std(axis=1)+1e-10 reward_normalized=(Reward_future-R_mean[:,np.newaxis])/R_std[:,np.newaxis] actions = torch.tensor(actions, dtype=torch.int8, device=device) old_probs = torch.tensor(old_probs, dtype=torch.float,device=device) rewards=torch.tensor(reward_normalized, dtype=torch.float, device=device)

# convert states to policy (or probability) new_probs = pong_utils.states_to_prob(policy, states) new_probs = torch.where(actions == pong_utils.RIGHT, new_probs, 1.0-new_probs) ratio=new_probs/old_probs clip=torch.clamp(ratio,1-epsilon,1+epsilon)
# include a regularization term # this steers new_policy towards 0.5 # prevents policy to become exactly 0 or 1 helps exploration # add in 1.e-10 to avoid log(0) which gives nan entropy = -(new_probs*torch.log(old_probs+1.e-10)+ \ (1.0-new_probs)*torch.log(1.0-old_probs+1.e-10))
return torch.mean(beta*entropy+torch.min(rewards*ratio,rewards*clip))

Reference

-[1] [Udacity DRLND course] https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893

-[2] [李宏毅machine learning2020] http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML20.html

-[3] [PPO paper] https://arxiv.org/abs/1707.06347


如果覺得內容對您有幫助,歡迎打賞、分享、並掃描下面的二維碼關注我的訂閱號,支持我定期更新有用有趣的機器學習乾貨內容。



- END -

相關焦點

  • 深度增強學習PPO(Proximal Policy Optimization)算法源碼走讀
    原文地址:https://blog.csdn.net/jinzhuojun/article/details/80417179作者:ariesjzjOpenAI出品的baselines項目提供了一系列deep reinforcement learning(DRL,深度強化學習或深度增強學習
  • 李弘毅深度強化學習筆記【1 Policy Gradient 】
    點擊上方「MLNLP」,選擇「星標」公眾號重磅乾貨,第一時間送達強化學習得三個主要要素
  • 深度強化學習-Policy Gradient基本實現
    有關DQN算法以及各種改進算法的原理和實現,可以參考之前的文章:實戰深度強化學習DQN-理論和實踐:DQN三大改進(一)-Double DQNDQN三大改進(二)-Prioritised replayDQN三大改進(三)-Dueling Network基於值的強化學習算法的基本思想是根據當前的狀態,計算採取每個動作的價值,然後根據價值貪心的選擇動作
  • 深度強化學習------入門總結
    強化學習(Reinforcement learning,RL)是機器學習的一個分支,並且有著跟監督學習、無監督學習完全不同的機制。在強化學習中,智能體(agent)與環境的交互中根據獲得的獎勵或懲罰不斷的學習知識,更加適應環境,是機器學習裡面最類似人類學習過程的一個模型。
  • 【強基固本】深度強化學習(Deep Reinforcement Learning)入門
    地址:https://www.zhihu.com/people/huaqingsong過去的一段時間在深度強化學習領域投入了不少精力,工作中也在應用DRL解決業務問題。子曰:溫故而知新,在進一步深入研究和應用DRL前,階段性的整理下相關知識點。
  • 幾行代碼輕鬆實現,Tensorlayer 2.0推出深度強化學習基準庫
    機器之心發布機器之心編輯部強化學習通過使用獎勵函數對智能體的策略進行優化,深度強化學習則將深度神經網絡應用於強化學習算法。深度強化學習由於其可擴展性,受到科研界和工業界越來越多的關注,其應用包括簡單的基於圖像的遊戲,和高度複雜的遊戲如星際爭霸,以及棋牌類遊戲如圍棋、德州撲克等,在機器人控制領域也逐漸被科研人員採用。近日,為了讓工業界能更好地使用前沿強化學習算法,Tensorlayer 強化學習團隊發布了專門面向工業界的整套強化學習基線算法庫---RLzoo。
  • IJTCS | 分論壇日程:多智能體強化學習
    本期帶來「多智能體強化學習」分論壇精彩介紹。多智能體強化學習是近年來新興的研究領域,它結合博弈論與深度強化學習,致力於解決複雜狀態、動作空間下的群體智能決策問題,在遊戲AI、工業機器人、社會預測等方面具有廣泛的應用前景。
  • 深度強化學習的 18 個關鍵問題討論
    www.paperweekly.site/papers/922■ 作者 | Nevertiree■ 論文 | Deep Reinforcement Learning: An Overview■ 連結 | http://www.paperweekly.site/papers/1372■ 作者 | Nevertiree原文歸納出深度強化學習中的常見科學問題
  • 如何選擇深度強化學習算法?MuZero/SAC/PPO/TD3/DDPG/DQN/等
    (多智能體、分層強化學習、逆向強化學習也會以它們為基礎開發新的算法):沒入門深度強化學習的人請按順序學習以下算法:入門深度學習/機器學習,用多層全連接層跑一下 MNIST數據集入門深度學習/深度學習框架,用卷積網絡跑一下 MNIST-fashion數據集入門經典強化學習 Q-learning,離散狀態、離散動作入門深度強化學習 DQN
  • China to continue prudent monetary policy
    China will continue to implement a prudent monetary policy, a People’s Bank of China (PBoC) official told China Business News on Tuesday.
  • 以模型為基礎的強化學習
    編者按:與無模型強化學習(MFRL)相比,以模型為基礎的強化學習(MBRL)有著怎樣的優勢和特點呢?MBRL是怎樣步步發展,又可以達成怎樣的目標呢?本期,來聽聽上海交通大學張偉楠副教授講授以模型為基礎的強化學習,希望能幫助大家更全面地了解這一強化學習方向。
  • Interview: Sound, selective investment policy key to maintain...
    BRUSSELS -- China needs to balance the need for expansive fiscal policy with more sound and selective investment policy to maintain economic health, a renowned scholar
  • 優化 | 從Mirror Descent的視角統一強化學習中的策略優化
    介紹一種求解強化學習問題的散度增廣策略優化算(Divergence-augmented Policy Optimization (DAPO))。DAPO主要是為了解決強化學習中policy optimization採樣效率低的問題,具體是解決在使用異策略數據和非線性函數(DNN)近似時優化過程十分不穩定的問題。
  • Giant of US foreign policy to depart senate
    For more than three decades, Lugar helped forge bipartisan consensus on critical foreign policy matters, most notably nuclear arms control.
  • 強化學習(一)入門介紹
    本講將對強化學習做一個整體的簡單介紹和概念引出,包括什麼是強化學習,強化學習要解決什麼問題,有一些什麼方法。
  • 基於多智能體強化學習的《星際爭霸II》中的大師級水平研究
    [3] Starcraft 2 AI ladder. https://sc2ai.net/.[4] Churchill, D., Lin, Z. & Synnaeve, G.Learning to predict by the method of temporal differences. Mach. Learn. 3, 9–44 (1988).[17] Oh, J., Guo, Y., Singh, S. & Lee, H. Self-Imitation Learning. Proc.
  • 強化學習——蒙特卡洛方法
    學習目標理解什麼是first-visit和every-visit;理解什麼是on-policy和off-policy;理解蒙特卡洛方法的Prediction和Control問題;Prediction和Control其實這兩個名詞在總結動態規劃方法的文章中也提到過了,但是沒有細說,這裡再簡單的說明一下。