強化學習是一種學習範式,它關注於如何學習控制一個系統,從而最大化表達一個長期目標的數值性能度量。強化學習與監督學習的區別在於,對於學習者的預測,只向學習者提供部分反饋。此外,預測還可能通過影響被控系統的未來狀態而產生長期影響。因此,時間起著特殊的作用。強化學習的目標是開發高效的學習算法,以及了解算法的優點和局限性。強化學習具有廣泛的實際應用價值,從人工智慧到運籌學或控制工程等領域。在這本書中,我們重點關注那些基於強大的動態規劃理論的強化學習算法。我們給出了一個相當全面的學習問題目錄,描述了核心思想,關注大量的最新算法,然後討論了它們的理論性質和局限性。
Preface ix
Acknowledgments xiii
Markov Decision Processes 1
Preliminaries 1
Markov Decision Processes 1
Value functions 6
Dynamic programming algorithms for solving MDPs 10
Value Prediction Problems 11
TD(lambda) with function approximation 22
Gradient temporal difference learning 25
Least-squares methods 27
The choice of the function space 33
Tabular TD(0) 11
Every-visit Monte-Carlo 14
TD(lambda): Unifying Monte-Carlo and TD(0) 16
Temporal difference learning in finite state spaces 11
Algorithms for large state spaces 18
Control 37
Implementing a critic 54
Implementing an actor 56
Q-learning in finite MDPs 47
Q-learning with function approximation 49
Online learning in bandits 38
Active learning in bandits 40
Active learning in Markov Decision Processes 41
Online learning in Markov Decision Processes 42
A catalog of learning problems 37
Closed-loop interactive learning 38
Direct methods 47
Actor-critic methods 52
For Further Exploration 63
Further reading 63
Applications 63
Software 64
Appendix: The Theory of Discounted Markovian Decision Processes 65
A.1 Contractions and Banach’s fixed-point theorem 65
A.2 Application to MDPs 69
Bibliography 73
Author's Biography 89
https://sites.ualberta.ca/~szepesva/rlbook.html
專知便捷查看
便捷下載,請關注專知公眾號(點擊上方藍色專知關注)