We will look at the situation where an agent decides what action to take sequentially over a finite time horizon to maximize revenue subject to resource consumption constraints. Every time the revenue generated will be some concave function of the action picked. Also, the revenue function and resource consumption will be sampled i.i.d. from some fixed distribution every time, and that distribution will be unknown to the agent. Balseiro, Hu, and Mirrokni (2020) introduced a general class of algorithms for this problem, which achieves sublinear regret. The algorithm maintains a dual multiplier at every time step and uses online mirror descent to update it. Depending on the regularizer chosen, the policy can be equivalent to dual sub-gradient descent and dual-multiplicative weight update. Moreover, this framework can achieve sublinear regret in repeated second-price auctions where, unlike the online allocation problem, bidders are unaware of their revenue functions and resource consumption (payment) before deciding their bid.
Our discussion will be based on the contents of the following paper:
"Dual Mirror Descent for Online Allocation Problems"-Santiago Balseiro, Haihao Lu, Vahab Mirrokni, In: Proceedings of the 37th International Conference on Machine Learning, PMLR 119:613-628, 2020