Non-stationary Reinforcement Learning under General Function
Approximation
- URL: http://arxiv.org/abs/2306.00861v1
- Date: Thu, 1 Jun 2023 16:19:37 GMT
- Title: Non-stationary Reinforcement Learning under General Function
Approximation
- Authors: Songtao Feng, Ming Yin, Ruiquan Huang, Yu-Xiang Wang, Jing Yang,
Yingbin Liang
- Abstract summary: We first propose a new complexity metric called dynamic Bellman Eluder (DBE) dimension for non-stationary MDPs.
Based on the proposed complexity metric, we propose a novel confidence-set based model-free algorithm called SW-OPEA.
We show that SW-OPEA is provably efficient as long as the variation budget is not significantly large.
- Score: 60.430936031067006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: General function approximation is a powerful tool to handle large state and
action spaces in a broad range of reinforcement learning (RL) scenarios.
However, theoretical understanding of non-stationary MDPs with general function
approximation is still limited. In this paper, we make the first such an
attempt. We first propose a new complexity metric called dynamic Bellman Eluder
(DBE) dimension for non-stationary MDPs, which subsumes majority of existing
tractable RL problems in static MDPs as well as non-stationary MDPs. Based on
the proposed complexity metric, we propose a novel confidence-set based
model-free algorithm called SW-OPEA, which features a sliding window mechanism
and a new confidence set design for non-stationary MDPs. We then establish an
upper bound on the dynamic regret for the proposed algorithm, and show that
SW-OPEA is provably efficient as long as the variation budget is not
significantly large. We further demonstrate via examples of non-stationary
linear and tabular MDPs that our algorithm performs better in small variation
budget scenario than the existing UCB-type algorithms. To the best of our
knowledge, this is the first dynamic regret analysis in non-stationary MDPs
with general function approximation.
Related papers
- Q-learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis [30.713243690224207]
In Markov decision processes (MDPs), quantile risk measures such as Value-at-Risk are a standard metric for modeling RL agents' preferences for certain outcomes.
This paper proposes a new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees.
arXiv Detail & Related papers (2024-10-31T16:53:20Z) - Two-Stage ML-Guided Decision Rules for Sequential Decision Making under Uncertainty [55.06411438416805]
Sequential Decision Making under Uncertainty (SDMU) is ubiquitous in many domains such as energy, finance, and supply chains.
Some SDMU are naturally modeled as Multistage Problems (MSPs) but the resulting optimizations are notoriously challenging from a computational standpoint.
This paper introduces a novel approach Two-Stage General Decision Rules (TS-GDR) to generalize the policy space beyond linear functions.
The effectiveness of TS-GDR is demonstrated through an instantiation using Deep Recurrent Neural Networks named Two-Stage Deep Decision Rules (TS-LDR)
arXiv Detail & Related papers (2024-05-23T18:19:47Z) - Efficient Duple Perturbation Robustness in Low-rank MDPs [14.53555781866821]
We introduce duple robustness, i.e. perturbation on both the feature and factor vectors for low-rank Markov decision processes (MDPs)
The novel robust MDP formulation is compatible with the function representation view, and therefore, is naturally applicable to practical RL problems with large or even continuous state-action spaces.
It also gives rise to a provably efficient and practical algorithm with theoretical convergence rate guarantee.
arXiv Detail & Related papers (2024-04-11T19:07:15Z) - Provably Efficient Algorithm for Nonstationary Low-Rank MDPs [48.92657638730582]
We make the first effort to investigate nonstationary RL under episodic low-rank MDPs, where both transition kernels and rewards may vary over time.
We propose a parameter-dependent policy optimization algorithm called PORTAL, and further improve PORTAL to its parameter-free version of Ada-PORTAL.
For both algorithms, we provide upper bounds on the average dynamic suboptimality gap, which show that as long as the nonstationarity is not significantly large, PORTAL and Ada-PORTAL are sample-efficient and can achieve arbitrarily small average dynamic suboptimality gap with sample complexity.
arXiv Detail & Related papers (2023-08-10T09:52:44Z) - Efficient Model-Free Exploration in Low-Rank MDPs [76.87340323826945]
Low-Rank Markov Decision Processes offer a simple, yet expressive framework for RL with function approximation.
Existing algorithms are either (1) computationally intractable, or (2) reliant upon restrictive statistical assumptions.
We propose the first provably sample-efficient algorithm for exploration in Low-Rank MDPs.
arXiv Detail & Related papers (2023-07-08T15:41:48Z) - A General Framework for Sample-Efficient Function Approximation in
Reinforcement Learning [132.45959478064736]
We propose a general framework that unifies model-based and model-free reinforcement learning.
We propose a novel estimation function with decomposable structural properties for optimization-based exploration.
Under our framework, a new sample-efficient algorithm namely OPtimization-based ExploRation with Approximation (OPERA) is proposed.
arXiv Detail & Related papers (2022-09-30T17:59:16Z) - Parallel Stochastic Mirror Descent for MDPs [72.75921150912556]
We consider the problem of learning the optimal policy for infinite-horizon Markov decision processes (MDPs)
Some variant of Mirror Descent is proposed for convex programming problems with Lipschitz-continuous functionals.
We analyze this algorithm in a general case and obtain an estimate of the convergence rate that does not accumulate errors during the operation of the method.
arXiv Detail & Related papers (2021-02-27T19:28:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.