BEGIN:VCALENDAR
PRODID:-//eluceo/ical//2.0/EN
VERSION:2.0
CALSCALE:GREGORIAN
BEGIN:VEVENT
UID:www.tcs.tifr.res.in/event/1028
DTSTAMP:20230914T125947Z
SUMMARY:Regret Minimization and Rare Events in Stochastic Multi-Armed Bandi
ts
DESCRIPTION:Speaker: Anirban Bhattacharjee\n\nAbstract: \nAbstract: In the
stochastic K-armed bandit problem\, an agent or learner is given a set of
K probability distributions (informally\, `arms'). It is known that these
probability distributions come from some family of distributions with cert
ain parameters\, but the precise values of the parameters are unknown. At
each given point of time\, the agent is allowed to draw one arm (or more\,
depending on the problem) and the objective is to maximize the total expe
cted reward over some given time horizon T. We measure how good a strategy
for drawing these arms is\, with respect to an oracle that knows a priori
which choices maximize the expected reward over time\, and define (inform
ally) regret as the difference in performance of the two. Lower bounds in
the one-arm-at-a-time case have been established by Burnetas and Katehakis
(1996) and met (upto a constant) by the Upper Confidence Bound (UCB) algo
rithm from Auer et al. (2002)\, when each of the arms produces rewards wit
hin the interval [0\,1]. However\, we often come across situations when am
ong the available choices\, there are high variations in probability of fa
vourable rewards\, but the choices which provide positive rewards with ver
y low probability also result in very high rewards and may in fact have gr
eater expected reward than a choice that is likelier to yield a positive r
eward (of course\, the latter reward has to be much lesser than the former
). A common example can be found in internet ad campaigns - an advertiseme
nt for regular use headphones (about INR 500 per pair) is much likelier to
result in the user making a purchase than an ad for a 200 year old Italia
n violin for sale (about INR 150000)\, but the latter product is associate
d with a much higher profit for the selling portal. The UCB algorithm\, in
such cases\, can be seen to perform poorly in an asymptotic regime\, as p
robabilities of observing positive rewards tend to zero. In this report\,
we establish lower bounds and propose a modified version of the UCB that i
s then proved to match the said lower bounds upto a constant\, under certa
in conditions.\nThe report ends with a short survey of regret bounds in co
mbinatorial multi-armed bandits and non-stationary multi-armed bandits\, t
wo problems that have recently been the subject of attention in bandit lit
erature.\n
URL:https://www.tcs.tifr.res.in/web/events/1028
DTSTART;TZID=Asia/Kolkata:20191226T150000
DTEND;TZID=Asia/Kolkata:20191226T160000
LOCATION:A-201 (STCS Seminar Room)
END:VEVENT
END:VCALENDAR