Abstract: In a multi-armed bandit problem a gambler needs to choose at each round one of K arms, each characterized by an unknown reward distribution. The objective is to maximize cumulative expected reward over a time horizon T, with performance being measured in terms of regret with respect to a static oracle that knows the best arm a priori.
This problem has been studied extensively when the reward distributions do not change over time. However, we shall look at the case when the reward distributions undergo changes over time with a fixed budget on the overall change. The talk will include results on lower bounds on regret with respect to a dynamic oracle that knows the best arm in each round, and will look at two algorithms that are known to be effective in this setting, both exceeding the lower bound by a logarithmic factor.