In this new series of Machine Learning Notebook, I start taking notes on Deep Reinforcement Learning, in English.
In this series, I will start with some of the 'recent' concepts in Deep Reinforcement Learning. That means we will not start with the very basic or 'ancient' approach of reinforcement learning, but with the exciting 'recent' approaches that catch many people's eyes. Yet it does not mean that the 'ancient' part is not important, if necessary, we will cover them gradually in the future articles.
I choose to write in English mainly because this thriving domain of Machine Learning is currently getting much more attention in the English speaking country and many discussions and materials are in English. Therefore it is necessary to get used to the way of discussing the related concepts in English.
IntroIn the previous blog, we summarized the idea of the REINFORCE algorithm, in which we update the policy function weights by gradient information of expected reward function.
However, this most basic policy gradient method has several problems and we will show what they are and the important tweaks in addition to it.
Some of these important tweaks lead us to Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO). These tweaks have allowed faster and more stable learning. PPO is now the most applied benchmark algorithm used by OpenAI and many ideas are very representative when engaging with the problems of state-of-art DRL algorithms.
Let's firstly revisit the key ingredients of the REINFORCE algorithm:
First, initialize a random policy
Second, we compute the total reward of the trajectory
Third, we update our policy using gradient ascent with learning rate
The process then repeats.
There are some main problems of REINFORCE:
The update process is very inefficient. The policy was run once, update once, and then it was thrown away.
The gradient estimate
There is no clear credit assignment. A trajectory may contain many good/bad actions and whether these actions are reinforced depends only on the final total output.
PPO proposed its solution to all these problems and we will cover them respectively below.
Noise ReductionIn the REINFORCE algorithm, we optimize the policy by maximizing the expected rewards
Instead of using millions or infinite of trajectories as noted by the mathematic equation, we simply take one trajectory to compute the gradient and update our policy.
Thus, this alternative makes our update comes down to chance, sometimes the only collected trajectory simply does not contain useful information. The hope is that after training for a long time, the tiny signal accumulates.
The easiest option to reduce the noise in gradient is to simply sample more trajectories. Using distributed computing, we can collect multiple trajectories in parallel. Then we can estimate the policy gradient by averaging across all the different trajectories.
Rewards NormalizationThere is another bonus for running multiple trajectories: we can collect all the total rewards and get a sense of how they are distributed.
In many cases, the distribution of rewards shifts as learning happens. Reward= 1 might be really good in the beginning, but really bad after 1000 training episodes.
Learning can be improved if we normalize the rewards,
where
Intuitively, normalizing the rewards roughly corresponds to picking half the actions to encourage/discourage, while also making sure the steps for gradient ascents are not too large/small.
Credit assignmentGoing back to the gradient estimate, we can take a closer look at the total reward
Let's think about what happens at time-step t. Even before action is decided, the agent has already received all the rewards up until step
Because we have a Markov process, the action at time-step
Notes on Gradient ModificationIt turns out that mathematically, ignoring past rewards might change the gradient for each specific trajectory, but it doesn't change the averaged gradient. So even though the gradient is different during training, on average we are still maximizing the average reward. In fact, the resultant gradient is less noisy. So training using future rewards should speed things up.
Importance SamplingIn the REINFORCE algorithm, we start with policy,
At this point, the trajectories we've just generated are simply thrown away. If we want to update our policy again, we would need to generate new trajectories once more, using the updated policy.
In fact, we need to compute the gradient for the current policy, and to do that the trajectories need to be representative of the current policy.
We could just reuse the recycled trajectories to compute gradients, and update the policy, again and again.
However, using the recycled trajectories (off-policy) has some problem since the collected old samples may have a distribution that is somewhat different from the current trajectories the agent will encounter. Hence we need some kind of sampling technique to ease the pain. This is where the importance sampling comes in. if we consider the trajectories collected by the old policy
If we want to compute the average of some quantity, say
Now we could rearrange this equation, by multiplying and dividing by the same number and rearrange the terms.
written in this way we can reinterpret the first part as the coefficient for sampling under the old policy, with an extra re-weighting factor, in addition to just averaging.
from [2]
Intuitively, this tells us we can use old trajectories for computing averages for a new policy, as long as we add this extra re-weighting factor, that takes into account how under or over-represented each trajectory is under the new policy compared to the old one.
The re-weighting factorWhen we take a closer look at the re-weighting factor.
Statistically, each probability here contains a chain of products of each policy at different time-step, as the equation shown above.
If we estimate
For instance, when some policy gets quite close to 0, the re-weighting factor can become close to zero or infinity. This will make the re-weighting trick unreliable.
In order to cope with the problem, a trick by introducing a surrogate function was used in PPO.
The surrogate Function (Proximal Policy)Let's take a look at our policy gradient again, by re-writing the derivative of the log term, we have:
Then, by re-arranging these equations, we replace the
Now, the idea of the proximal policy comes in. Here we assume that if the old and the current policy is close enough to each other, we would like to treat all the factors in the "..." as a number very close to 1, and only leave the terms shown as below:
At last, we approximate the re-weighted factor with the output of the two policies by the same
With this approximated gradient, we can think of it as the gradient of a new object, called the surrogate function
therefore, we aim to maximize this surrogate function with the proximal gradient.
Now there left another important issue, if the two distributions of the policy differ too much, the assumptions that we made previously do not valid anymore. Hence there must be some constraints to avoid that from happening.
To cope with this, one of the solutions is to introduce KL Divergence as regularization.
KL Divergence as regularization (TRPO and PPO)In the original papers of TRPO(predecessor of PPO) and PPO, despite all the complex mathematical derivations, they mainly tried to add KL divergence between policies as a constraint of optimization.
In PPO, the optimization objective is like this
in which it makes the KL divergence term as a differentiable regularization term, and this makes the optimization process a bit easier.
from [2]
In TRPO, it treats the KL divergence as a more general constraint condition, which is mathematically more precise but very hard to optimize.
In practice, researchers found that the approach of PPO has very similar results as TRPO, yet the former is relatively much easier to implement.
Clipping Policy Update (PPO-2)In PPO-2, we replace the KL divergence regularization with clipping Policy Update.
When the policy distribution differs too much, it is highly like that the policy in which you sample your trajectories will lead to cliff jumping effect in the current learning policy. In such a case, it could be impossible to jump out of the bad policy plateau.
Hence, an intuitive idea to deal with cliff jumping is to give restrictions to surrogate function.
The formula of clipped surrogate function is
We want to make sure the two policy
from [2]
The whole Clipping surrogate function could be implemented as follows:
def clipped_surrogate(policy, old_probs, states, actions, rewards, discount = 0.995, epsilon=0.1, beta=0.01):
discounts=discount**np.arange(len(rewards)) Reward=np.asarray(rewards)*discounts[:,np.newaxis] Reward_future=Reward[::-1].cumsum(axis=0)[::-1]
R_mean=Reward_future.mean(axis=1) R_std=Reward_future.std(axis=1)+1e-10 reward_normalized=(Reward_future-R_mean[:,np.newaxis])/R_std[:,np.newaxis] actions = torch.tensor(actions, dtype=torch.int8, device=device) old_probs = torch.tensor(old_probs, dtype=torch.float,device=device) rewards=torch.tensor(reward_normalized, dtype=torch.float, device=device)
# convert states to policy (or probability) new_probs = pong_utils.states_to_prob(policy, states) new_probs = torch.where(actions == pong_utils.RIGHT, new_probs, 1.0-new_probs) ratio=new_probs/old_probs clip=torch.clamp(ratio,1-epsilon,1+epsilon)
# include a regularization term # this steers new_policy towards 0.5 # prevents policy to become exactly 0 or 1 helps exploration # add in 1.e-10 to avoid log(0) which gives nan entropy = -(new_probs*torch.log(old_probs+1.e-10)+ \ (1.0-new_probs)*torch.log(1.0-old_probs+1.e-10))
return torch.mean(beta*entropy+torch.min(rewards*ratio,rewards*clip))
Reference-[1] [Udacity DRLND course] https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893
-[2] [李宏毅machine learning2020] http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML20.html
-[3] [PPO paper] https://arxiv.org/abs/1707.06347
如果覺得內容對您有幫助,歡迎打賞、分享、並掃描下面的二維碼關注我的訂閱號,支持我定期更新有用有趣的機器學習乾貨內容。
- END -