Stochastic Contextual Bandits with Long Horizon Rewards
- URL: http://arxiv.org/abs/2302.00814v2
- Date: Fri, 3 Feb 2023 20:01:20 GMT
- Title: Stochastic Contextual Bandits with Long Horizon Rewards
- Authors: Yuzhen Qin, Yingcong Li, Fabio Pasqualetti, Maryam Fazel, Samet Oymak
- Abstract summary: This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on most $s$ prior actions.
We propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly.
Our results necessitate new analysis to address long-range temporal dependencies across data and avoid dependence on the reward horizon $h$.
- Score: 31.981315673799806
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The growing interest in complex decision-making and language modeling
problems highlights the importance of sample-efficient learning over very long
horizons. This work takes a step in this direction by investigating contextual
linear bandits where the current reward depends on at most $s$ prior actions
and contexts (not necessarily consecutive), up to a time horizon of $h$. In
order to avoid polynomial dependence on $h$, we propose new algorithms that
leverage sparsity to discover the dependence pattern and arm parameters
jointly. We consider both the data-poor ($T<h$) and data-rich ($T\ge h$)
regimes, and derive respective regret upper bounds $\tilde O(d\sqrt{sT} +\min\{
q, T\})$ and $\tilde O(\sqrt{sdT})$, with sparsity $s$, feature dimension $d$,
total time horizon $T$, and $q$ that is adaptive to the reward dependence
pattern. Complementing upper bounds, we also show that learning over a single
trajectory brings inherent challenges: While the dependence pattern and arm
parameters form a rank-1 matrix, circulant matrices are not isometric over
rank-1 manifolds and sample complexity indeed benefits from the sparse reward
dependence structure. Our results necessitate a new analysis to address
long-range temporal dependencies across data and avoid polynomial dependence on
the reward horizon $h$. Specifically, we utilize connections to the restricted
isometry property of circulant matrices formed by dependent sub-Gaussian
vectors and establish new guarantees that are also of independent interest.
Related papers
- Sparse Linear Bandits with Blocking Constraints [22.01704171400845]
We investigate the high-dimensional sparse linear bandits problem in a data-poor regime.<n>We show novel offline statistical guarantees of the lasso estimator for the linear model.<n>We propose a meta-algorithm based on corralling that does not need knowledge of optimal sparsity parameter $k$ at minimal cost to regret.
arXiv Detail & Related papers (2024-10-26T01:42:03Z) - Restless Linear Bandits [5.00389879175348]
It is assumed that there exists an unknown $mathbbRd$-valued stationary $varphi$-mixing sequence of parameters $(theta_t,t in mathbbN)$ which gives rise to pay-offs.
An optimistic algorithm, called LinMix-UCB, is proposed for the case where $theta_t$ has an exponential mixing rate.
arXiv Detail & Related papers (2024-05-17T14:37:39Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z) - Online Learning and Bandits with Queried Hints [28.270453093780382]
We consider the classic online learning and multi-armed bandit (MAB) problems.
We derive algorithms whose regret bounds have exponentially better dependence on the time horizon.
We show that probing with $k=2$ suffices to achieve time-independent regret bounds for online linear and convex optimization.
arXiv Detail & Related papers (2022-11-04T18:41:08Z) - Exploration in Linear Bandits with Rich Action Sets and its Implications
for Inference [23.364534479492715]
We show that the minimum eigenvalue of the expected matrix grows as $Omega(sqrtn) whenever the expected cumulative regret of the algorithm is $sqrtn)$.
We apply our result to two practical scenarios -- emphmodel selection and emphclustering in linear bandits.
arXiv Detail & Related papers (2022-07-23T20:25:07Z) - Horizon-Free Reinforcement Learning in Polynomial Time: the Power of
Stationary Policies [88.75843804630772]
We design an algorithm that achieves an $Oleft(mathrmpoly(S,A,log K)sqrtKright)$ regret in contrast to existing bounds.
Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies.
arXiv Detail & Related papers (2022-03-24T08:14:12Z) - Leverage Score Sampling for Tensor Product Matrices in Input Sparsity
Time [54.65688986250061]
We give an input sparsity time sampling algorithm for approximating the Gram matrix corresponding to the $q$-fold column-wise tensor product of $q$ matrices.
Our sampling technique relies on a collection of $q$ partially correlated random projections which can be simultaneously applied to a dataset $X$ in total time.
arXiv Detail & Related papers (2022-02-09T15:26:03Z) - Settling the Horizon-Dependence of Sample Complexity in Reinforcement
Learning [82.31436758872715]
We develop an algorithm that achieves the same PAC guarantee while using only $O(1)$ episodes of environment interactions.
We establish a connection between value functions in discounted and finite-horizon Markov decision processes.
arXiv Detail & Related papers (2021-11-01T00:21:24Z) - On the Sample Complexity of Privately Learning Axis-Aligned Rectangles [16.092248433189816]
We revisit the fundamental problem of learning Axis-Aligned-Rectangles over a finite grid.
We present a novel algorithm that reduces the sample complexity to only $tildeOleftdcdotleft(log*|X|right)1.5right$.
arXiv Detail & Related papers (2021-07-24T04:06:11Z) - Joint Online Learning and Decision-making via Dual Mirror Descent [20.560099149143245]
We consider an online revenue problem over a finite time horizon subject to lower and upper bounds on cost.
We propose a novel offline benchmark that mixes an online dual mirror descent scheme with a generic parameter learning process.
When the parameter is not known and must be learned, we demonstrate that the regret and constraint violations are the sums of the previous $O(sqrtT)$ terms plus terms that directly depend on the convergence of the learning process.
arXiv Detail & Related papers (2021-04-20T04:02:07Z) - Doubly robust Thompson sampling for linear payoffs [12.375561840897742]
We propose a novel multi-armed contextual bandit algorithm called Doubly Robust (DR) Thompson Sampling.
The proposed algorithm is designed to allow a novel additive regret decomposition leading to an improved regret bound with the order of $tildeO(phi-2sqrtT)$.
arXiv Detail & Related papers (2021-02-01T23:31:10Z) - High-Dimensional Sparse Linear Bandits [67.9378546011416]
We derive a novel $Omega(n2/3)$ dimension-free minimax regret lower bound for sparse linear bandits in the data-poor regime.
We also prove a dimension-free $O(sqrtn)$ regret upper bound under an additional assumption on the magnitude of the signal for relevant features.
arXiv Detail & Related papers (2020-11-08T16:48:11Z) - Linear-Sample Learning of Low-Rank Distributions [56.59844655107251]
We show that learning $ktimes k$, rank-$r$, matrices to normalized $L_1$ distance requires $Omega(frackrepsilon2)$ samples.
We propose an algorithm that uses $cal O(frackrepsilon2log2fracepsilon)$ samples, a number linear in the high dimension, and nearly linear in the matrices, typically low, rank proofs.
arXiv Detail & Related papers (2020-09-30T19:10:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.