Reinforcement Learning for Non-Stationary Markov Decision Processes: The
Blessing of (More) Optimism
- URL: http://arxiv.org/abs/2006.14389v1
- Date: Wed, 24 Jun 2020 15:40:21 GMT
- Title: Reinforcement Learning for Non-Stationary Markov Decision Processes: The
Blessing of (More) Optimism
- Authors: Wang Chi Cheung, David Simchi-Levi, Ruihao Zhu
- Abstract summary: We consider un-discounted reinforcement learning (RL) in Markov decision processes (MDPs) under drifting non-stationarity.
We first develop the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (SWUCRL2-CW) algorithm.
We propose the Bandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune the SWUCRL2-CW algorithm to achieve the same dynamic regret bound.
- Score: 25.20231604057821
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider un-discounted reinforcement learning (RL) in Markov decision
processes (MDPs) under drifting non-stationarity, i.e., both the reward and
state transition distributions are allowed to evolve over time, as long as
their respective total variations, quantified by suitable metrics, do not
exceed certain variation budgets. We first develop the Sliding Window
Upper-Confidence bound for Reinforcement Learning with Confidence Widening
(SWUCRL2-CW) algorithm, and establish its dynamic regret bound when the
variation budgets are known. In addition, we propose the
Bandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune the
SWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in a
parameter-free manner, i.e., without knowing the variation budgets. Notably,
learning non-stationary MDPs via the conventional optimistic exploration
technique presents a unique challenge absent in existing (non-stationary)
bandit learning settings. We overcome the challenge by a novel confidence
widening technique that incorporates additional optimism.
Related papers
- Robust Offline Reinforcement Learning for Non-Markovian Decision Processes [48.9399496805422]
We study the learning problem of robust offline non-Markovian RL.
We introduce a novel dataset distillation and a lower confidence bound (LCB) design for robust values under different types of the uncertainty set.
By further introducing a novel type-I concentrability coefficient tailored for offline low-rank non-Markovian decision processes, we prove that our algorithm can find an $epsilon$-optimal robust policy.
arXiv Detail & Related papers (2024-11-12T03:22:56Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Stochastic Unrolled Federated Learning [85.6993263983062]
We introduce UnRolled Federated learning (SURF), a method that expands algorithm unrolling to federated learning.
Our proposed method tackles two challenges of this expansion, namely the need to feed whole datasets to the unrolleds and the decentralized nature of federated learning.
arXiv Detail & Related papers (2023-05-24T17:26:22Z) - Non-stationary Risk-sensitive Reinforcement Learning: Near-optimal
Dynamic Regret, Adaptive Detection, and Separation Design [9.554944575754638]
We study risk-sensitive reinforcement learning (RL) based on an entropic risk measure in episodic non-stationary Markov decision processes (MDPs)
We propose two restart-based algorithms, namely Restart-RSMB and Restart-RSQ, and establish their dynamic regrets.
This work offers the first non-asymptotic theoretical analyses for the non-stationary risk-sensitive RL in the literature.
arXiv Detail & Related papers (2022-11-19T22:40:09Z) - Opportunistic Episodic Reinforcement Learning [9.364712393700056]
opportunistic reinforcement learning is a new variant of reinforcement learning problems where the regret of selecting a suboptimal action varies under an external environmental condition known as the variation factor.
Our intuition is to exploit more when the variation factor is high, and explore more when the variation factor is low.
Our algorithms balance the exploration-exploitation trade-off for reinforcement learning by introducing variation factor-dependent optimism to guide exploration.
arXiv Detail & Related papers (2022-10-24T18:02:33Z) - Fundamental Limits of Reinforcement Learning in Environment with
Endogeneous and Exogeneous Uncertainty [5.117030416610515]
Online reinforcement learning (RL) has been widely applied in information processing scenarios.
We consider an un-discounted RL in general Markov decision processes (MDPs) with both endogeneous and exogeneous uncertainty.
We establish a regret bound of saving at most $sqrtS$ or $Sfrac16Tfrac112$ compared with the latest results in the literature.
arXiv Detail & Related papers (2021-06-15T22:57:45Z) - Low-Precision Reinforcement Learning [63.930246183244705]
Low-precision training has become a popular approach to reduce computation time, memory footprint, and energy consumption in supervised learning.
In this paper we consider continuous control with the state-of-the-art SAC agent and demonstrate that a na"ive adaptation of low-precision methods from supervised learning fails.
arXiv Detail & Related papers (2021-02-26T16:16:28Z) - Nonstationary Reinforcement Learning with Linear Function Approximation [19.521419943509784]
We consider reinforcement learning in episodic Markov decision processes (MDPs) with linear function approximation under drifting environment.
We first develop an optimistic modification of least-squares value with periodic restart, and bound its dynamic regret when variation budgets are known.
We derive the first minimax dynamic regret lower bound for nonstationary linear MDPs and as a byproduct establish a minimax regret lower bound for linear MDPs unsolved by Jin et al.
arXiv Detail & Related papers (2020-10-08T20:07:44Z) - Reparameterized Variational Divergence Minimization for Stable Imitation [57.06909373038396]
We study the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms.
We contribute a re parameterization trick for adversarial imitation learning to alleviate the challenges of the promising $f$-divergence minimization framework.
Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks.
arXiv Detail & Related papers (2020-06-18T19:04:09Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.