Bandits with Heavy Tails: Algorithms, Analysis and Optimality

Shubhada Agrawal
Sandeep K Juneja
Friday, 9 Dec 2022, 11:30 to 12:30
Multi-armed bandit (MAB) is a popular framework for sequential decision-making in an uncertain environment. In the classical setup of MAB, the algorithm has access to a fixed and finite set K of unknown, independent probability distributions or arms. At each time step, having observed the outcomes of all the previous actions, the algorithm chooses one of the K arms and receives an independent sample drawn from the underlying distribution, which may be considered a reward. The algorithm's goal is either to maximize the accumulated rewards or to identify the best arm in as few samples as possible for an appropriate notion of best.
Variants of these classical formulations have been widely studied in the literature. Tight lower bounds and optimal algorithms have been developed under the assumption that the arm distributions are either Sub-Gaussian or come from a single parameter exponential family (SPEF), for example, Gaussian distributions with a known variance or Bernoulli distributions. However, in practice, these distributional assumptions may not hold. Developing lower bounds and optimal algorithms for the general distributions largely remained open, mainly because of the need for new tools for the analysis in this generality.
In this dissertation, we undertake a detailed study of the MAB problems allowing for all the distributions with a known uniform bound on their (1+\epsilon)^{th} moments for some $\epsilon > 0$. This class subsumes a large class of heavy-tailed distributions. We develop a framework with essential tools and concentration inequalities and use it to design optimal algorithms for three key variants of the MAB problem, including the classical frameworks of regret minimization and best-arm identification.
A key component of designing an optimal algorithm for MAB is constructing tight, anytime valid confidence intervals (CIs) for mean. We develop new concentration inequalities to this end, which may be of independent interest.
The above results were obtained in collaborations involving Sandeep Juneja, Wouter M. Koolen and Peter Glynn.

Zoom link: