Revisiting Weighted Strategy for Non-stationary Parametric Bandits
- URL: http://arxiv.org/abs/2303.02691v2
- Date: Wed, 7 Jun 2023 06:44:28 GMT
- Title: Revisiting Weighted Strategy for Non-stationary Parametric Bandits
- Authors: Jing Wang, Peng Zhao, Zhi-Hua Zhou
- Abstract summary: This paper revisits the weighted strategy for non-stationary parametric bandits.
We propose a refined analysis framework, which produces a simpler weight-based algorithm.
Our new framework can be used to improve regret bounds of other parametric bandits.
- Score: 82.1942459195896
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-stationary parametric bandits have attracted much attention recently.
There are three principled ways to deal with non-stationarity, including
sliding-window, weighted, and restart strategies. As many non-stationary
environments exhibit gradual drifting patterns, the weighted strategy is
commonly adopted in real-world applications. However, previous theoretical
studies show that its analysis is more involved and the algorithms are either
computationally less efficient or statistically suboptimal. This paper revisits
the weighted strategy for non-stationary parametric bandits. In linear bandits
(LB), we discover that this undesirable feature is due to an inadequate regret
analysis, which results in an overly complex algorithm design. We propose a
refined analysis framework, which simplifies the derivation and importantly
produces a simpler weight-based algorithm that is as efficient as
window/restart-based algorithms while retaining the same regret as previous
studies. Furthermore, our new framework can be used to improve regret bounds of
other parametric bandits, including Generalized Linear Bandits (GLB) and
Self-Concordant Bandits (SCB). For example, we develop a simple weighted GLB
algorithm with an $\widetilde{O}(k_\mu^{\frac{5}{4}} c_\mu^{-\frac{3}{4}}
d^{\frac{3}{4}} P_T^{\frac{1}{4}}T^{\frac{3}{4}})$ regret, improving the
$\widetilde{O}(k_\mu^{2} c_\mu^{-1}d^{\frac{9}{10}}
P_T^{\frac{1}{5}}T^{\frac{4}{5}})$ bound in prior work, where $k_\mu$ and
$c_\mu$ characterize the reward model's nonlinearity, $P_T$ measures the
non-stationarity, $d$ and $T$ denote the dimension and time horizon.
Related papers
- Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems [61.85150061213987]
We study the generalized low-rank matrix bandit problem, proposed in citelu2021low under the Generalized Linear Model (GLM) framework.
To overcome the computational infeasibility and theoretical restrain of existing algorithms, we first propose the G-ESTT framework.
We show that G-ESTT can achieve the $tildeO(sqrt(d_1+d_2)3/2Mr3/2T)$ bound of regret while G-ESTS can achineve the $tildeO
arXiv Detail & Related papers (2024-01-14T14:14:19Z) - Computationally Efficient Horizon-Free Reinforcement Learning for Linear
Mixture MDPs [111.75736569611159]
We propose the first computationally efficient horizon-free algorithm for linear mixture MDPs.
Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic.
This also improves upon the best-known algorithms in this setting when $sigma_k2$'s are known.
arXiv Detail & Related papers (2022-05-23T17:59:18Z) - Optimal and Efficient Dynamic Regret Algorithms for Non-Stationary
Dueling Bandits [27.279654173896372]
We study the problem of emphdynamic regret minimization in $K$-armed Dueling Bandits under non-stationary or time varying preferences.
This is an online learning setup where the agent chooses a pair of items at each round and observes only a relative binary win-loss' feedback for this pair.
arXiv Detail & Related papers (2021-11-06T16:46:55Z) - Non-stationary Linear Bandits Revisited [26.082923174615495]
Non-stationary linear bandits are a variant of linear bandits with a time-varying underlying regression parameter.
We prove an $widetildeO(T3/4(1+P_T)1/4)$ dynamic regret for these algorithms, slightly worse than the rate as was anticipated.
arXiv Detail & Related papers (2021-03-09T10:07:17Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - Non-stationary Reinforcement Learning without Prior Knowledge: An
Optimal Black-box Approach [42.021871809877595]
We present a black-box reduction that turns a certain reinforcement learning algorithm with optimal regret in a near-stationary environment into another algorithm with optimal dynamic regret in a non-stationary environment.
We show that our approach significantly improves the state of the art for linear bandits, episodic MDPs, and infinite-horizon MDPs.
arXiv Detail & Related papers (2021-02-10T12:43:31Z) - Efficient Learning in Non-Stationary Linear Markov Decision Processes [17.296084954104415]
We study episodic reinforcement learning in non-stationary linear (a.k.a. low-rank) Markov Decision Processes (MDPs)
We propose OPT-WLSVI an optimistic model-free algorithm based on weighted least squares value which uses exponential weights to smoothly forget data that are far in the past.
We show that our algorithm, when competing against the best policy at each time, achieves a regret that is upper bounded by $widetildemathcalO(d5/4H2 Delta1/4 K3/4)$ where $d$
arXiv Detail & Related papers (2020-10-24T11:02:45Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.