Indian Institute of Technology Hyderabad.
- In person @ A-201 and also via Zoom
(This talk consists of joint work with Prof. Rajeeva L. Karandikar) Reinforcement Learning (RL) algorithms such as Temporal Difference (TDL) or Q-learning update *just one component* of the value function (TDL) or the action-value function (Q) at each time step. This is known as asynchronous stochastic approximation. There are two issues with this. First, many of the "convergence proofs" in the literature are not always correct. Second, when the dimension of the state space is very high, learning requires a huge number of time steps. In effect, spatial complexity is replaced by temporal complexity. A compromise is to update *some but not all* components of the value, or the action-value, function at each time step. This may be called Batch Asynchronous Stochastic Approximation (BASA). In this talk, I will present a very general framework for proving the convergence of BASA, which includes both TD learning and Q-learning as special cases, and also leads to new algorithms with lower complexity.