Online Robust Reinforcement Learning with Model Uncertainty
- URL: http://arxiv.org/abs/2109.14523v1
- Date: Wed, 29 Sep 2021 16:17:47 GMT
- Title: Online Robust Reinforcement Learning with Model Uncertainty
- Authors: Yue Wang, Shaofeng Zou
- Abstract summary: We develop a sample-based approach to estimate the unknown uncertainty set and design a robust Q-learning and robust TDC algorithm.
For the robust Q-learning algorithm, we prove that it converges to the optimal robust Q function, and for the robust TDC algorithm, we prove that it converges to some stationary points.
Our approach can be readily extended to robustify many other algorithms, e.g., TD, SARSA, and other GTD algorithms.
- Score: 24.892994430374912
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Robust reinforcement learning (RL) is to find a policy that optimizes the
worst-case performance over an uncertainty set of MDPs. In this paper, we focus
on model-free robust RL, where the uncertainty set is defined to be centering
at a misspecified MDP that generates a single sample trajectory sequentially
and is assumed to be unknown. We develop a sample-based approach to estimate
the unknown uncertainty set and design a robust Q-learning algorithm (tabular
case) and robust TDC algorithm (function approximation setting), which can be
implemented in an online and incremental fashion. For the robust Q-learning
algorithm, we prove that it converges to the optimal robust Q function, and for
the robust TDC algorithm, we prove that it converges asymptotically to some
stationary points. Unlike the results in [Roy et al., 2017], our algorithms do
not need any additional conditions on the discount factor to guarantee the
convergence. We further characterize the finite-time error bounds of the two
algorithms and show that both the robust Q-learning and robust TDC algorithms
converge as fast as their vanilla counterparts(within a constant factor). Our
numerical experiments further demonstrate the robustness of our algorithms. Our
approach can be readily extended to robustify many other algorithms, e.g., TD,
SARSA, and other GTD algorithms.
Related papers
- Q-learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis [30.713243690224207]
In Markov decision processes (MDPs), quantile risk measures such as Value-at-Risk are a standard metric for modeling RL agents' preferences for certain outcomes.
This paper proposes a new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees.
arXiv Detail & Related papers (2024-10-31T16:53:20Z) - Provably Efficient UCB-type Algorithms For Learning Predictive State
Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs)
This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models.
In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z) - Policy Gradient Algorithms for Robust MDPs with Non-Rectangular
Uncertainty Sets [10.26382228865201]
We propose policy gradient algorithms for robust infinite-horizon Markov decision processes (MDPs) with non-rectangular uncertainty sets.
The corresponding robust MDPs cannot be solved with dynamic programming techniques and are in fact provably intractable.
We thus present the first complete solution scheme for robust MDPs with non-rectangular uncertainty sets offering global optimality guarantees.
arXiv Detail & Related papers (2023-05-30T13:02:25Z) - Stochastic Ratios Tracking Algorithm for Large Scale Machine Learning
Problems [0.7614628596146599]
We propose a novel algorithm for adaptive step length selection in the classical SGD framework.
Under reasonable conditions, the algorithm produces step lengths in line with well-established theoretical requirements.
We show that the algorithm can generate step lengths comparable to the best step length obtained from manual tuning.
arXiv Detail & Related papers (2023-05-17T06:22:11Z) - On Practical Robust Reinforcement Learning: Practical Uncertainty Set
and Double-Agent Algorithm [11.748284119769039]
Robust reinforcement learning (RRL) aims at seeking a robust policy to optimize the worst case performance over an uncertainty set of Markov decision processes (MDPs)
arXiv Detail & Related papers (2023-05-11T08:52:09Z) - Fully Stochastic Trust-Region Sequential Quadratic Programming for
Equality-Constrained Optimization Problems [62.83783246648714]
We propose a sequential quadratic programming algorithm (TR-StoSQP) to solve nonlinear optimization problems with objectives and deterministic equality constraints.
The algorithm adaptively selects the trust-region radius and, compared to the existing line-search StoSQP schemes, allows us to utilize indefinite Hessian matrices.
arXiv Detail & Related papers (2022-11-29T05:52:17Z) - Instance-Dependent Confidence and Early Stopping for Reinforcement
Learning [99.57168572237421]
Various algorithms for reinforcement learning (RL) exhibit dramatic variation in their convergence rates as a function of problem structure.
This research provides guarantees that explain textitex post the performance differences observed.
A natural next step is to convert these theoretical guarantees into guidelines that are useful in practice.
arXiv Detail & Related papers (2022-01-21T04:25:35Z) - Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters.
Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - A Dynamical Systems Approach for Convergence of the Bayesian EM
Algorithm [59.99439951055238]
We show how (discrete-time) Lyapunov stability theory can serve as a powerful tool to aid, or even lead, in the analysis (and potential design) of optimization algorithms that are not necessarily gradient-based.
The particular ML problem that this paper focuses on is that of parameter estimation in an incomplete-data Bayesian framework via the popular optimization algorithm known as maximum a posteriori expectation-maximization (MAP-EM)
We show that fast convergence (linear or quadratic) is achieved, which could have been difficult to unveil without our adopted S&C approach.
arXiv Detail & Related papers (2020-06-23T01:34:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.