Impact of Representation Learning in Linear Bandits
- URL: http://arxiv.org/abs/2010.06531v2
- Date: Wed, 5 May 2021 17:01:53 GMT
- Title: Impact of Representation Learning in Linear Bandits
- Authors: Jiaqi Yang, Wei Hu, Jason D. Lee, Simon S. Du
- Abstract summary: We study how representation learning can improve the efficiency of bandit problems.
We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
- Score: 83.17684841392754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study how representation learning can improve the efficiency of bandit
problems. We study the setting where we play $T$ linear bandits with dimension
$d$ concurrently, and these $T$ bandit tasks share a common $k (\ll d)$
dimensional linear representation. For the finite-action setting, we present a
new algorithm which achieves $\widetilde{O}(T\sqrt{kN} + \sqrt{dkNT})$ regret,
where $N$ is the number of rounds we play for each bandit. When $T$ is
sufficiently large, our algorithm significantly outperforms the naive algorithm
(playing $T$ bandits independently) that achieves $\widetilde{O}(T\sqrt{d N})$
regret. We also provide an $\Omega(T\sqrt{kN} + \sqrt{dkNT})$ regret lower
bound, showing that our algorithm is minimax-optimal up to poly-logarithmic
factors. Furthermore, we extend our algorithm to the infinite-action setting
and obtain a corresponding regret bound which demonstrates the benefit of
representation learning in certain regimes. We also present experiments on
synthetic and real-world data to illustrate our theoretical findings and
demonstrate the effectiveness of our proposed algorithms.
Related papers
- Context-lumpable stochastic bandits [49.024050919419366]
We consider a contextual bandit problem with $S$ contexts and $K$ actions.
We give an algorithm that outputs an $epsilon$-optimal policy after using at most $widetilde O(r (S +K )/epsilon2)$ samples.
In the regret setting, we give an algorithm whose cumulative regret up to time $T$ is bounded by $widetilde O(sqrtr3(S+K)T)$.
arXiv Detail & Related papers (2023-06-22T17:20:30Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Nearly Minimax Algorithms for Linear Bandits with Shared Representation [86.79657561369397]
We consider the setting where we play $M$ linear bandits with dimension $d$, each for $T$ rounds, and these $M$ bandit tasks share a common $k(ll d)$ dimensional linear representation.
We come up with novel algorithms that achieve $widetildeOleft(dsqrtkMT + kMsqrtTright)$ regret bounds, which matches the known minimax regret lower bound up to logarithmic factors.
arXiv Detail & Related papers (2022-03-29T15:27:13Z) - Corralling a Larger Band of Bandits: A Case Study on Switching Regret
for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly.
The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon.
Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Low-Rank Generalized Linear Bandit Problems [19.052901817304743]
In a low-rank linear bandit problem, the reward of an action is the inner product between the action and an unknown low-rank matrix $Theta*$.
We propose an algorithm based on a novel combination of online-to-confidence-set conversioncitepabbasi2012online and the exponentially weighted average forecaster constructed by a covering of low-rank matrices.
arXiv Detail & Related papers (2020-06-04T15:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.