Autoregressive Bandits
- URL: http://arxiv.org/abs/2212.06251v2
- Date: Mon, 19 Feb 2024 19:01:36 GMT
- Title: Autoregressive Bandits
- Authors: Francesco Bacchiocchi, Gianmarco Genalti, Davide Maran, Marco Mussi,
Marcello Restelli, Nicola Gatti, Alberto Maria Metelli
- Abstract summary: We propose a novel online learning setting, Autoregressive Bandits, in which the observed reward is governed by an autoregressive process of order $k$.
We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed.
We then devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G
- Score: 58.46584210388307
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive processes naturally arise in a large variety of real-world
scenarios, including stock markets, sales forecasting, weather prediction,
advertising, and pricing. When facing a sequential decision-making problem in
such a context, the temporal dependence between consecutive observations should
be properly accounted for guaranteeing convergence to the optimal policy. In
this work, we propose a novel online learning setting, namely, Autoregressive
Bandits (ARBs), in which the observed reward is governed by an autoregressive
process of order $k$, whose parameters depend on the chosen action. We show
that, under mild assumptions on the reward process, the optimal policy can be
conveniently computed. Then, we devise a new optimistic regret minimization
algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers
sublinear regret of order $\widetilde{\mathcal{O}} \left(
\frac{(k+1)^{3/2}\sqrt{nT}}{(1-\Gamma)^2}\right)$, where $T$ is the
optimization horizon, $n$ is the number of actions, and $\Gamma < 1$ is a
stability index of the process. Finally, we empirically validate our algorithm,
illustrating its advantages w.r.t. bandit baselines and its robustness to
misspecification of key parameters.
Related papers
- Online Resource Allocation in Episodic Markov Decision Processes [1.8416014644193066]
We formulate the problem as an online allocation problem in an episodic finite-horizon constrained Markov decision process.
We propose the observe-then-decide regime and improve the existing decide-then-observe regime.
We show that the regret against the static optimal policy that has access to the mean reward and mean resource consumption functions is bounded by $tilde O(rho-1H3/2SsqrtAT)$ with high probability.
arXiv Detail & Related papers (2023-05-18T06:28:34Z) - Towards Painless Policy Optimization for Constrained MDPs [46.12526917024248]
We study policy optimization in an infinite horizon, $gamma$-discounted constrained Markov decision process (CMDP)
Our objective is to return a policy that achieves large expected reward with a small constraint violation.
We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms.
arXiv Detail & Related papers (2022-04-11T15:08:09Z) - Variational Bayesian Optimistic Sampling [43.130465039780084]
We consider online sequential decision problems where an agent must balance exploration and exploitation.
We derive a set of Bayesian optimistic' policies which, in the multi-armed bandit case, includes the Thompson sampling policy.
We show that any algorithm producing policies in the optimistic set enjoys $tilde O(sqrtAT)$ Bayesian regret for a problem with $A$ actions after $T$ rounds.
arXiv Detail & Related papers (2021-10-29T11:28:29Z) - Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes.
Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - Reward Biased Maximum Likelihood Estimation for Reinforcement Learning [13.820705458648233]
Reward-Biased Maximum Likelihood Estimate (RBMLE) for adaptive control of Markov chains was proposed.
We show that it has a regret of $mathcalO( log T)$ over a time horizon of $T$ steps, similar to state-of-the-art algorithms.
arXiv Detail & Related papers (2020-11-16T06:09:56Z) - Nearly Dimension-Independent Sparse Linear Bandit over Small Action
Spaces via Best Subset Selection [71.9765117768556]
We consider the contextual bandit problem under the high dimensional linear model.
This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine.
We propose doubly growing epochs and estimating the parameter using the best subset selection method.
arXiv Detail & Related papers (2020-09-04T04:10:39Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.