Regret Minimization and Rare Events in Stochastic Multi-Armed Bandits

Speaker:
Anirban Bhattacharjee
Date:
Thursday, 26 Dec 2019, 15:00 to 16:00
Venue:
A-201 (STCS Seminar Room)
Abstract
Abstract: In the stochastic K-armed bandit problem, an agent or learner is given a set of K probability distributions (informally, `arms'). It is known that these probability distributions come from some family of distributions with certain parameters, but the precise values of the parameters are unknown. At each given point of time, the agent is allowed to draw one arm (or more, depending on the problem) and the objective is to maximize the total expected reward over some given time horizon T. We measure how good a strategy for drawing these arms is, with respect to an oracle that knows a priori which choices maximize the expected reward over time, and define (informally) regret as the difference in performance of the two. Lower bounds in the one-arm-at-a-time case have been established by Burnetas and Katehakis (1996) and met (upto a constant) by the Upper Confidence Bound (UCB) algorithm from Auer et al. (2002), when each of the arms produces rewards within the interval [0,1]. However, we often come across situations when among the available choices, there are high variations in probability of favourable rewards, but the choices which provide positive rewards with very low probability also result in very high rewards and may in fact have greater expected reward than a choice that is likelier to yield a positive reward (of course, the latter reward has to be much lesser than the former). A common example can be found in internet ad campaigns - an advertisement for regular use headphones (about INR 500 per pair) is much likelier to result in the user making a purchase than an ad for a 200 year old Italian violin for sale (about INR 150000), but the latter product is associated with a much higher profit for the selling portal. The UCB algorithm, in such cases, can be seen to perform poorly in an asymptotic regime, as probabilities of observing positive rewards tend to zero. In this report, we establish lower bounds and propose a modified version of the UCB that is then proved to match the said lower bounds upto a constant, under certain conditions.
The report ends with a short survey of regret bounds in combinatorial multi-armed bandits and non-stationary multi-armed bandits, two problems that have recently been the subject of attention in bandit literature.