Nonstationary Reinforcement Learning with Linear Function Approximation
- URL: http://arxiv.org/abs/2010.04244v3
- Date: Sat, 13 Apr 2024 06:52:10 GMT
- Title: Nonstationary Reinforcement Learning with Linear Function Approximation
- Authors: Huozhi Zhou, Jinglin Chen, Lav R. Varshney, Ashish Jagmohan,
- Abstract summary: We consider reinforcement learning in episodic Markov decision processes (MDPs) with linear function approximation under drifting environment.
We first develop an optimistic modification of least-squares value with periodic restart, and bound its dynamic regret when variation budgets are known.
We derive the first minimax dynamic regret lower bound for nonstationary linear MDPs and as a byproduct establish a minimax regret lower bound for linear MDPs unsolved by Jin et al.
- Score: 19.521419943509784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider reinforcement learning (RL) in episodic Markov decision processes (MDPs) with linear function approximation under drifting environment. Specifically, both the reward and state transition functions can evolve over time but their total variations do not exceed a $\textit{variation budget}$. We first develop $\texttt{LSVI-UCB-Restart}$ algorithm, an optimistic modification of least-squares value iteration with periodic restart, and bound its dynamic regret when variation budgets are known. Then we propose a parameter-free algorithm $\texttt{Ada-LSVI-UCB-Restart}$ that extends to unknown variation budgets. We also derive the first minimax dynamic regret lower bound for nonstationary linear MDPs and as a byproduct establish a minimax regret lower bound for linear MDPs unsolved by Jin et al. (2020). Finally, we provide numerical experiments to demonstrate the effectiveness of our proposed algorithms.
Related papers
- Settling Constant Regrets in Linear Markov Decision Processes [57.34287648914407]
We study the constant regret guarantees in reinforcement learning (RL)
We introduce an algorithm, Cert-LSVI-UCB, for misspecified linear Markov decision processes (MDPs)
For an MDP characterized by a minimal suboptimality gap $Delta$, Cert-LSVI-UCB has a cumulative regret of $tildemathcalO(d3H5/Delta)$ with high probability, provided that the misspecification level $zeta$ is below $tildemathcalO(Delta / (sqrtd
arXiv Detail & Related papers (2024-04-16T17:23:19Z) - Learning Adversarial Low-rank Markov Decision Processes with Unknown
Transition and Full-information Feedback [30.23951525723659]
We study the low-rank MDPs with adversarially changed losses in the full-information feedback setting.
We propose a policy optimization-based algorithm POLO, and we prove that it attains the $widetildeO(Kfrac56Afrac12dln (1+M)/ (1-gamma)2)$ regret guarantee.
arXiv Detail & Related papers (2023-11-14T03:12:43Z) - Non-stationary Reinforcement Learning under General Function
Approximation [60.430936031067006]
We first propose a new complexity metric called dynamic Bellman Eluder (DBE) dimension for non-stationary MDPs.
Based on the proposed complexity metric, we propose a novel confidence-set based model-free algorithm called SW-OPEA.
We show that SW-OPEA is provably efficient as long as the variation budget is not significantly large.
arXiv Detail & Related papers (2023-06-01T16:19:37Z) - Provably Efficient Model-Free Algorithms for Non-stationary CMDPs [10.930095238723327]
We study model-free reinforcement learning algorithms in episodic non-stationary constrained Markov Decision Processes.
In the non-stationary environment, reward, utility functions, and transition kernels can vary arbitrarily over time as long as the cumulative variations do not exceed certain variation budgets.
We propose the first model-free, simulator-free RL algorithms with sublinear regret and zero constraint violation for non-stationary CMDPs.
arXiv Detail & Related papers (2023-03-10T06:33:38Z) - Logarithmic Switching Cost in Reinforcement Learning beyond Linear MDPs [31.673857053336352]
We propose an algorithm that achieves the near-optimal regret with a switching cost logarithmic in the number of episodes and linear in the time-horizon $H$.
We also show the doubling trick'' used in ELEANOR-LowSwitching can be further leveraged for the generalized linear function approximation.
arXiv Detail & Related papers (2023-02-24T05:14:27Z) - Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise.
Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret.
We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - Efficient Learning in Non-Stationary Linear Markov Decision Processes [17.296084954104415]
We study episodic reinforcement learning in non-stationary linear (a.k.a. low-rank) Markov Decision Processes (MDPs)
We propose OPT-WLSVI an optimistic model-free algorithm based on weighted least squares value which uses exponential weights to smoothly forget data that are far in the past.
We show that our algorithm, when competing against the best policy at each time, achieves a regret that is upper bounded by $widetildemathcalO(d5/4H2 Delta1/4 K3/4)$ where $d$
arXiv Detail & Related papers (2020-10-24T11:02:45Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z) - Frequentist Regret Bounds for Randomized Least-Squares Value Iteration [94.47472987987805]
We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning (RL)
In this paper, we introduce an optimistically-perturbed variant of the popular randomized least-squares value (RLSVI)
Under the assumption that the Markov decision process has low-rank transition dynamics, we prove that the frequentist regret of RLSVI is upper-bounded by $widetilde O(d2 H2 sqrtT)$ where $ d $ are the feature dimension, $ H $ is the horizon, and $ T $ is the total number of
arXiv Detail & Related papers (2019-11-01T19:48:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.