In this talk, we will revisit the classical regret-minimization problem in the multi-armed bandit setting. This problem has been well studied when the arm distributions are restricted to have either bounded support or belong to a single parameter exponential family (distributions characterized by 1 parameter, for example, Gaussian with fixed variance, Bernoulli distributions, Poisson distributions, etc.). However, in many applications, the underlying arm distributions fail to satisfy these simplifying assumptions. We will consider a much general class of arm distributions and work with a weaker assumption that the moments of order $(1 + \epsilon)$ are uniformly bounded by a known constant B, for some given $\epsilon > 0$. We will look at an optimal algorithm that matches the lower bound exactly in the in the first-order term.
Regret-minimization algorithms typically rely on constructing tight confidence intervals for means of the underlying distributions. We will also look at new anytime-valid confidence intervals for means, which are based on the empirical likelihood principle. Algorithms using the MGF-based concentration of the empirical mean for bounded support distributions (Auer et al., 2002), or of the robust estimators for mean, like the truncated empirical mean in the case of heavy-tailed distributions (Bubeck et al., 2013), exist in the literature. We will see exactly where the framework of the optimal algorithm gains over these existing algorithms that use MGF-based confidence intervals for the mean.
If time permits, we will also look at a mixture-martingale-based proof for the validity of the proposed confidence intervals.
This talk is based on joint work with Sandeep Juneja and Wouter M. Koolen (the paper can be found here). I will not assume any prerequisites beyond a basic understanding of probability theory.