Howard's Policy Iteration (HPI) is a classic algorithm for solving Markov Decision Problems (MDPs). HPI uses a "greedy" switching rule to update from any non-optimal policy to a dominating one, iterating until an optimal policy is found. Although HPI has existed for over 60 years, and is remarkably efficient in practice, theoretical analysis has struggled to explain its efficiency. On MDPs wth n states and 2 actions per state, the number of policies is 2^{n}; this number is a trivial upper bound on the number of iterations taken by HPI. Interestingly, the tightest upper bounds known for the same quantity is O(2^{n}/n), a mere linear-factor improvement. In contrast, the tightest known lower bound is Omega(n). In this talk, I will present two recent results, which establish tighter upper bounds on the running time of HPI on restricted classes of MDPs. On MDPs whose transitions are all deterministic (DMDPs), we show an upper bound of poly(n) \phi^{n} iterations, where \phi is the golden ratio 1.618.... This result marks an exponential improvement over Mansour and Singh's bound of O(2^{n}/n) for general MDPs. Further, in DMDPs with rewards of constant bit-size, we furnish a subexponential-in-n upper bound. Our results emerge from contrasting lines of anlaysis: the first from a count of cycles in a family of digraphs, and the second from a root-separation bound for univariate polynomials.
Short Bio : Shivaram Kalyanakrishnan is an Associate Professor in the Department of Computer Science and Engineering at the Indian Institute of Technology Bombay. His research interests include artificial intelligence and machine learning, spanning topics such as sequential decision making, multi-agent learning, multi-armed bandits, and humanoid robotics. Kalyanakrishnan received a Ph.D. in computer science from the University of Texas at Austin. Subsequently he was a Research Scientist at Yahoo Labs Bangalore and an INSPIRE Faculty Fellow at the Indian Institute of Science, Bangalore. Kalyanakrishnan works on both theoretical and applied problems. He heads e-Yantra, a large-scale outreach programme that annually trains thousands of students in robotics.