Finite Continuum-Armed Bandits
- URL: http://arxiv.org/abs/2010.12236v2
- Date: Tue, 3 Nov 2020 14:37:47 GMT
- Title: Finite Continuum-Armed Bandits
- Authors: Solenne Gaucher (LMO)
- Abstract summary: We consider a situation where an agent has $T$ ressources to be allocated to a larger number of actions.
The goal of the agent is to maximize her cumulative reward.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider a situation where an agent has $T$ ressources to be allocated to
a larger number $N$ of actions. Each action can be completed at most once and
results in a stochastic reward with unknown mean. The goal of the agent is to
maximize her cumulative reward. Non trivial strategies are possible when side
information on the actions is available, for example in the form of covariates.
Focusing on a nonparametric setting, where the mean reward is an unknown
function of a one-dimensional covariate, we propose an optimal strategy for
this problem. Under natural assumptions on the reward function, we prove that
the optimal regret scales as $O(T^{1/3})$ up to poly-logarithmic factors when
the budget $T$ is proportional to the number of actions $N$. When $T$ becomes
small compared to $N$, a smooth transition occurs. When the ratio $T/N$
decreases from a constant to $N^{-1/3}$, the regret increases progressively up
to the $O(T^{1/2})$ rate encountered in continuum-armed bandits.
Related papers
- Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism [1.4999444543328293]
We propose a model-based algorithm based on novel cost and reward function estimators.
Our algorithm achieves a regret upper bound of $widetildemathcalO((bar C - bar C_b)-1H2.5 SsqrtAK)$ where $bar C$ is the cost budget for an episode.
arXiv Detail & Related papers (2024-10-14T04:51:06Z) - Variance-aware robust reinforcement learning with linear function
approximation with heavy-tailed rewards [6.932056534450556]
We present two algorithms, AdaOFUL and VARA, for online sequential decision-making in the presence of heavy-tailed rewards.
AdaOFUL achieves a state-of-the-art regret bound of $widetildemathcalObig.
VarA achieves a tighter variance-aware regret bound of $widetildemathcalO(dsqrtHmathcalG*K)$.
arXiv Detail & Related papers (2023-03-09T22:16:28Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Autoregressive Bandits [58.46584210388307]
We propose a novel online learning setting, Autoregressive Bandits, in which the observed reward is governed by an autoregressive process of order $k$.
We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed.
We then devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G
arXiv Detail & Related papers (2022-12-12T21:37:36Z) - A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits [11.918230810566945]
We study the non-stationary multi-armed bandit problem, where the reward statistics of each arm may change several times during the course of learning.
We propose a method that achieves, in $K$-armed bandit problems, a near-optimal $widetilde O(sqrtK N(S+1))$ dynamic regret.
arXiv Detail & Related papers (2022-01-17T17:23:56Z) - On Submodular Contextual Bandits [92.45432756301231]
We consider the problem of contextual bandits where actions are subsets of a ground set and mean rewards are modeled by an unknown monotone submodular function.
We show that our algorithm efficiently randomizes around local optima of estimated functions according to the Inverse Gap Weighting strategy.
arXiv Detail & Related papers (2021-12-03T21:42:33Z) - Provably Breaking the Quadratic Error Compounding Barrier in Imitation
Learning, Optimally [58.463668865380946]
We study the statistical limits of Imitation Learning in episodic Markov Decision Processes (MDPs) with a state space $mathcalS$.
We establish an upper bound $O(|mathcalS|H3/2/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al ( 2020)
We show the minimax suboptimality grows as $Omega( H3/2/N)$ when $mathcalS|geq 3$ while the unknown-transition setting suffers from a larger sharp rate
arXiv Detail & Related papers (2021-02-25T15:50:19Z) - Fast Rates for the Regret of Offline Reinforcement Learning [69.23654172273085]
We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinitehorizon discounted decision process (MDP)
We show that given any estimate for the optimal quality function $Q*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q*$-estimate's pointwise convergence rate.
arXiv Detail & Related papers (2021-01-31T16:17:56Z) - Variance-Aware Confidence Set: Variance-Dependent Bound for Linear
Bandits and Horizon-Free Bound for Linear Mixture MDP [76.94328400919836]
We show how to construct variance-aware confidence sets for linear bandits and linear mixture Decision Process (MDP)
For linear bandits, we obtain an $widetildeO(mathrmpoly(d)sqrt1 + sum_i=1Ksigma_i2) regret bound, where $d is the feature dimension.
For linear mixture MDP, we obtain an $widetildeO(mathrmpoly(d)sqrtK)$ regret bound, where
arXiv Detail & Related papers (2021-01-29T18:57:52Z) - Minimax Regret for Stochastic Shortest Path with Adversarial Costs and
Known Transition [37.6975819766632]
We study the shortest path problem with adversarial costs and known transition.
We show that the minimax regret is $widetildeO(sqrtDTstar K)$ and $widetildeO(sqrtDTstar SA K)$ for the full-information setting and the bandit feedback setting.
arXiv Detail & Related papers (2020-12-07T20:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.