深度強化學習筆記-Policy Based Method 3 - Proximal Policy Optimization

2021-03-02 波哥咋也機器學習

Deep Reinforcement learning - Policy-Based Method 3: Proximal Policy Optimization

In this new series of Machine Learning Notebook, I start taking notes on Deep Reinforcement Learning, in English.

In this series, I will start with some of the 'recent' concepts in Deep Reinforcement Learning. That means we will not start with the very basic or 'ancient' approach of reinforcement learning, but with the exciting 'recent' approaches that catch many people's eyes. Yet it does not mean that the 'ancient' part is not important, if necessary, we will cover them gradually in the future articles.

I choose to write in English mainly because this thriving domain of Machine Learning is currently getting much more attention in the English speaking country and many discussions and materials are in English. Therefore it is necessary to get used to the way of discussing the related concepts in English.

Intro

In the previous blog, we summarized the idea of the REINFORCE algorithm, in which we update the policy function weights by gradient information of expected reward function.

However, this most basic policy gradient method has several problems and we will show what they are and the important tweaks in addition to it.

Some of these important tweaks lead us to Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO). These tweaks have allowed faster and more stable learning. PPO is now the most applied benchmark algorithm used by OpenAI and many ideas are very representative when engaging with the problems of state-of-art DRL algorithms.

Beyond Reinforcement (Monte Carlo)

Let's firstly revisit the key ingredients of the REINFORCE algorithm:

First, initialize a random policy

Second, we compute the total reward of the trajectory

Third, we update our policy using gradient ascent with learning rate

The process then repeats.

There are some main problems of REINFORCE:

The update process is very inefficient. The policy was run once, update once, and then it was thrown away.

The gradient estimate noisy. By chance, the collected trajectory may not be representative of the policy.

There is no clear credit assignment. A trajectory may contain many good/bad actions and whether these actions are reinforced depends only on the final total output.

PPO proposed its solution to all these problems and we will cover them respectively below.

Noise Reduction

In the REINFORCE algorithm, we optimize the policy by maximizing the expected rewards

Instead of using millions or infinite of trajectories as noted by the mathematic equation, we simply take one trajectory to compute the gradient and update our policy.

Thus, this alternative makes our update comes down to chance, sometimes the only collected trajectory simply does not contain useful information. The hope is that after training for a long time, the tiny signal accumulates.

The easiest option to reduce the noise in gradient is to simply sample more trajectories. Using distributed computing, we can collect multiple trajectories in parallel. Then we can estimate the policy gradient by averaging across all the different trajectories.

Rewards Normalization

There is another bonus for running multiple trajectories: we can collect all the total rewards and get a sense of how they are distributed.

In many cases, the distribution of rewards shifts as learning happens. Reward= 1 might be really good in the beginning, but really bad after 1000 training episodes.

Learning can be improved if we normalize the rewards,

where

Intuitively, normalizing the rewards roughly corresponds to picking half the actions to encourage/discourage, while also making sure the steps for gradient ascents are not too large/small.

Credit assignment

Going back to the gradient estimate, we can take a closer look at the total reward

Let's think about what happens at time-step t. Even before action is decided, the agent has already received all the rewards up until step

Because we have a Markov process, the action at time-step

Notes on Gradient Modification

It turns out that mathematically, ignoring past rewards might change the gradient for each specific trajectory, but it doesn't change the averaged gradient. So even though the gradient is different during training, on average we are still maximizing the average reward. In fact, the resultant gradient is less noisy. So training using future rewards should speed things up.

Importance Sampling

In the REINFORCE algorithm, we start with policy,

At this point, the trajectories we've just generated are simply thrown away. If we want to update our policy again, we would need to generate new trajectories once more, using the updated policy.

In fact, we need to compute the gradient for the current policy, and to do that the trajectories need to be representative of the current policy.

We could just reuse the recycled trajectories to compute gradients, and update the policy, again and again.

However, using the recycled trajectories (off-policy) has some problem since the collected old samples may have a distribution that is somewhat different from the current trajectories the agent will encounter. Hence we need some kind of sampling technique to ease the pain. This is where the importance sampling comes in. if we consider the trajectories collected by the old policy

If we want to compute the average of some quantity, say

Now we could rearrange this equation, by multiplying and dividing by the same number and rearrange the terms.

written in this way we can reinterpret the first part as the coefficient for sampling under the old policy, with an extra re-weighting factor, in addition to just averaging.

from [2]

Intuitively, this tells us we can use old trajectories for computing averages for a new policy, as long as we add this extra re-weighting factor, that takes into account how under or over-represented each trajectory is under the new policy compared to the old one.

The re-weighting factor

When we take a closer look at the re-weighting factor.

Statistically, each probability here contains a chain of products of each policy at different time-step, as the equation shown above.

If we estimate

For instance, when some policy gets quite close to 0, the re-weighting factor can become close to zero or infinity. This will make the re-weighting trick unreliable.

In order to cope with the problem, a trick by introducing a surrogate function was used in PPO.

The surrogate Function (Proximal Policy)

Let's take a look at our policy gradient again, by re-writing the derivative of the log term, we have:

Then, by re-arranging these equations, we replace the

Now, the idea of the proximal policy comes in. Here we assume that if the old and the current policy is close enough to each other, we would like to treat all the factors in the "..." as a number very close to 1, and only leave the terms shown as below:

At last, we approximate the re-weighted factor with the output of the two policies by the same

With this approximated gradient, we can think of it as the gradient of a new object, called the surrogate function

therefore, we aim to maximize this surrogate function with the proximal gradient.

Now there left another important issue, if the two distributions of the policy differ too much, the assumptions that we made previously do not valid anymore. Hence there must be some constraints to avoid that from happening.

To cope with this, one of the solutions is to introduce KL Divergence as regularization.

KL Divergence as regularization (TRPO and PPO)

In the original papers of TRPO(predecessor of PPO) and PPO, despite all the complex mathematical derivations, they mainly tried to add KL divergence between policies as a constraint of optimization.

In PPO, the optimization objective is like this

in which it makes the KL divergence term as a differentiable regularization term, and this makes the optimization process a bit easier.

from [2]

In TRPO, it treats the KL divergence as a more general constraint condition, which is mathematically more precise but very hard to optimize.

In practice, researchers found that the approach of PPO has very similar results as TRPO, yet the former is relatively much easier to implement.

Clipping Policy Update (PPO-2)

In PPO-2, we replace the KL divergence regularization with clipping Policy Update.

When the policy distribution differs too much, it is highly like that the policy in which you sample your trajectories will lead to cliff jumping effect in the current learning policy. In such a case, it could be impossible to jump out of the bad policy plateau.

Hence, an intuitive idea to deal with cliff jumping is to give restrictions to surrogate function.

The formula of clipped surrogate function is

We want to make sure the two policy

from [2]

The whole Clipping surrogate function could be implemented as follows:


def clipped_surrogate(policy, old_probs, states, actions, rewards,                      discount = 0.995, epsilon=0.1, beta=0.01):

    discounts=discount**np.arange(len(rewards))    Reward=np.asarray(rewards)*discounts[:,np.newaxis]        Reward_future=Reward[::-1].cumsum(axis=0)[::-1]
    R_mean=Reward_future.mean(axis=1)    R_std=Reward_future.std(axis=1)+1e-10        reward_normalized=(Reward_future-R_mean[:,np.newaxis])/R_std[:,np.newaxis]                actions = torch.tensor(actions, dtype=torch.int8, device=device)    old_probs = torch.tensor(old_probs, dtype=torch.float,device=device)    rewards=torch.tensor(reward_normalized, dtype=torch.float, device=device)

    # convert states to policy (or probability)    new_probs = pong_utils.states_to_prob(policy, states)    new_probs = torch.where(actions == pong_utils.RIGHT, new_probs, 1.0-new_probs)        ratio=new_probs/old_probs    clip=torch.clamp(ratio,1-epsilon,1+epsilon)
        # include a regularization term    # this steers new_policy towards 0.5    # prevents policy to become exactly 0 or 1 helps exploration    # add in 1.e-10 to avoid log(0) which gives nan    entropy = -(new_probs*torch.log(old_probs+1.e-10)+ \        (1.0-new_probs)*torch.log(1.0-old_probs+1.e-10))
    return torch.mean(beta*entropy+torch.min(rewards*ratio,rewards*clip))
 Reference -[1] [Udacity DRLND course] https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893
-[2] [李宏毅machine learning2020] http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML20.html
-[3] [PPO paper] https://arxiv.org/abs/1707.06347

如果覺得內容對您有幫助，歡迎打賞、分享、並掃描下面的二維碼關注我的訂閱號，支持我定期更新有用有趣的機器學習乾貨內容。


- END -

深度強化學習筆記-Policy Based Method 3 - Proximal Policy Optimization

相關焦點

深度增強學習PPO(Proximal Policy Optimization)算法源碼走讀

李弘毅深度強化學習筆記【1 Policy Gradient 】

深度強化學習-Policy Gradient基本實現

深度強化學習------入門總結

【強基固本】深度強化學習(Deep Reinforcement Learning)入門

幾行代碼輕鬆實現,Tensorlayer 2.0推出深度強化學習基準庫

IJTCS | 分論壇日程:多智能體強化學習

深度強化學習的 18 個關鍵問題討論

如何選擇深度強化學習算法?MuZero/SAC/PPO/TD3/DDPG/DQN/等

China to continue prudent monetary policy

以模型為基礎的強化學習

Interview: Sound, selective investment policy key to maintain...

優化 | 從Mirror Descent的視角統一強化學習中的策略優化

Giant of US foreign policy to depart senate

強化學習(一)入門介紹

基於多智能體強化學習的《星際爭霸II》中的大師級水平研究

強化學習——蒙特卡洛方法