Demystifying Approximate Reinforcement Learning Algorithms that use epsilon-greedy Exploration

Aditya Gopalan
Sandeep K Juneja
Friday, 9 Dec 2022, 14:30 to 15:30
In reinforcement learning, value-function methods such as Q-learning and SARSA(0) with $\epsilon$-greedy exploration are among the state of the art, and their tabular (exact) forms converge to the optimal Q-function under reasonable conditions. However, with function approximation, these methods are known to exhibit strange behaviors, e.g., policy oscillation and chattering, convergence to different attractors (possibly even the worst policy) on different runs, etc., apart from the well-known instability of iterates. Accordingly, a theory to explain these phenomena has been a long-standing open problem, even for basic linear function approximation (Sutton, 1999). Our work uses differential inclusion theory to provide the first framework for resolving this problem. We further illustrate via numerical examples how this framework helps completely explain these algorithms' asymptotic behaviors. (Joint work with Gugan Thoppe, IISc)