Tata Institute of Fundamental Research

Teaching your computer to play chess (Part 2): Bandits, MCTS, and Approximate Policy Iteration

STCS Student Seminar

Speaker:	Aakash Ghosh (TIFR)
Organiser:	Soumyajit Pyne
Date:	Friday, 8 May 2026, 16:00 to 17:00
Venue:	A-201 (STCS Seminar Room)

(Scan to add to calendar)

Abstract:

Building on the theoretical limitations of pure TD-learning explored in Part 1, how do modern neural engines like AlphaZero actually master the game? In the second part of this series, we shift our focus to local search framed as a sequential decision-making problem under uncertainty. We will introduce the Multi-Armed Bandit problem and the UCB1 algorithm, extending these concepts to game trees via Monte Carlo Tree Search (MCTS). Finally, we will deconstruct the AlphaZero framework, demonstrating how it utilizes MCTS as a formal policy improvement operator to perform Approximate Policy Iteration and successfully evade the "Deadly Triad" of deep reinforcement learning.