Online Learning with Vector Costs and Bandits with Knapsacks
- URL: http://arxiv.org/abs/2010.07346v1
- Date: Wed, 14 Oct 2020 18:28:41 GMT
- Title: Online Learning with Vector Costs and Bandits with Knapsacks
- Authors: Thomas Kesselheim and Sahil Singla
- Abstract summary: We introduce online learning with vector costs (OLVCp) where in each time step $t in 1,ldots, T$, we need to play an action that incurs an unknown vector cost.
The goal of the online algorithm is to minimize the $ell_p$ norm of the sum of its cost vectors.
- Score: 8.340947016086739
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce online learning with vector costs (\OLVCp) where in each time
step $t \in \{1,\ldots, T\}$, we need to play an action $i \in \{1,\ldots,n\}$
that incurs an unknown vector cost in $[0,1]^{d}$. The goal of the online
algorithm is to minimize the $\ell_p$ norm of the sum of its cost vectors. This
captures the classical online learning setting for $d=1$, and is interesting
for general $d$ because of applications like online scheduling where we want to
balance the load between different machines (dimensions).
We study \OLVCp in both stochastic and adversarial arrival settings, and give
a general procedure to reduce the problem from $d$ dimensions to a single
dimension. This allows us to use classical online learning algorithms in both
full and bandit feedback models to obtain (near) optimal results. In
particular, we obtain a single algorithm (up to the choice of learning rate)
that gives sublinear regret for stochastic arrivals and a tight $O(\min\{p,
\log d\})$ competitive ratio for adversarial arrivals.
The \OLVCp problem also occurs as a natural subproblem when trying to solve
the popular Bandits with Knapsacks (\BwK) problem. This connection allows us to
use our \OLVCp techniques to obtain (near) optimal results for \BwK in both
stochastic and adversarial settings. In particular, we obtain a tight $O(\log d
\cdot \log T)$ competitive ratio algorithm for adversarial \BwK, which improves
over the $O(d \cdot \log T)$ competitive ratio algorithm of Immorlica et al.
[FOCS'19].
Related papers
- Online Learning of Halfspaces with Massart Noise [47.71073318490341]
We study the task of online learning in the presence of Massart noise.
We present a computationally efficient algorithm that achieves mistake bound $eta T + o(T)$.
We use our Massart online learner to design an efficient bandit algorithm that obtains expected reward at least $(1-1/k) Delta T - o(T)$ bigger than choosing a random action at every round.
arXiv Detail & Related papers (2024-05-21T17:31:10Z) - Corralling a Larger Band of Bandits: A Case Study on Switching Regret
for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly.
The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon.
Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z) - Contextual Recommendations and Low-Regret Cutting-Plane Algorithms [49.91214213074933]
We consider the following variant of contextual linear bandits motivated by routing applications in navigational engines and recommendation systems.
We design novel cutting-plane algorithms with low "regret" -- the total distance between the true point $w*$ and the hyperplanes the separation oracle returns.
arXiv Detail & Related papers (2021-06-09T05:39:05Z) - Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure.
We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds.
We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z) - Near-Optimal Reinforcement Learning with Self-Play [50.29853537456737]
We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision.
We propose an optimistic variant of the emphNash Q-learning algorithm with sample complexity $tildemathcalO(SAB)$, and a new emphNash V-learning algorithm with sample complexity $tildemathcalO(S(A+B))$.
arXiv Detail & Related papers (2020-06-22T05:00:13Z) - Contextual Bandits with Side-Observations [10.248045133793287]
We investigate contextual bandits in the presence of side-observations across arms in order to design recommendation algorithms for users connected via social networks.
We show that a naive application of existing learning algorithms results in $Oleft(Nlog Tright)$ regret, where $N$ is the number of users.
arXiv Detail & Related papers (2020-06-06T19:34:50Z) - Taking a hint: How to leverage loss predictors in contextual bandits? [63.546913998407405]
We study learning in contextual bandits with the help of loss predictors.
We show that the optimal regret is $mathcalO(minsqrtT, sqrtmathcalETfrac13)$ when $mathcalE$ is known.
arXiv Detail & Related papers (2020-03-04T07:36:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.