Dynamic Regret of Policy Optimization in Non-stationary Environments
- URL: http://arxiv.org/abs/2007.00148v1
- Date: Tue, 30 Jun 2020 23:34:37 GMT
- Title: Dynamic Regret of Policy Optimization in Non-stationary Environments
- Authors: Yingjie Fei, Zhuoran Yang, Zhaoran Wang, Qiaomin Xie
- Abstract summary: We propose two model-free policy optimization algorithms, POWER and POWER++, and establish guarantees for their dynamic regret.
We show that POWER++ improves over POWER on the second component of the dynamic regret by actively adapting to non-stationarity through prediction.
To the best of our knowledge, our work is the first dynamic regret analysis of model-free RL algorithms in non-stationary environments.
- Score: 120.01408308460095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider reinforcement learning (RL) in episodic MDPs with adversarial
full-information reward feedback and unknown fixed transition kernels. We
propose two model-free policy optimization algorithms, POWER and POWER++, and
establish guarantees for their dynamic regret. Compared with the classical
notion of static regret, dynamic regret is a stronger notion as it explicitly
accounts for the non-stationarity of environments. The dynamic regret attained
by the proposed algorithms interpolates between different regimes of
non-stationarity, and moreover satisfies a notion of adaptive
(near-)optimality, in the sense that it matches the (near-)optimal static
regret under slow-changing environments. The dynamic regret bound features two
components, one arising from exploration, which deals with the uncertainty of
transition kernels, and the other arising from adaptation, which deals with
non-stationary environments. Specifically, we show that POWER++ improves over
POWER on the second component of the dynamic regret by actively adapting to
non-stationarity through prediction. To the best of our knowledge, our work is
the first dynamic regret analysis of model-free RL algorithms in non-stationary
environments.
Related papers
- Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM [17.661231232206028]
Simultaneous localization and mapping (SLAM) with implicit neural representations has received extensive attention.
We propose a novel SLAM framework for dynamic environments.
arXiv Detail & Related papers (2024-07-18T09:35:48Z) - Vector Autoregressive Evolution for Dynamic Multi-Objective Optimisation [7.5104598146227]
Dynamic multi-objective optimisation (DMO) handles optimisation problems with multiple objectives in varying environments.
This paper proposes vector autoregressive evolution (VARE) consisting of vector autoregression ( VAR) and environment-aware hypermutation to address environmental changes in DMO.
arXiv Detail & Related papers (2023-05-22T06:24:25Z) - Universal Online Optimization in Dynamic Environments via Uniclass
Prediction [0.0]
We propose a novel and intuitive framework for universal online optimization in dynamic environments.
Our strategy does not rely on the construction of a set of experts and an accompanying meta-algorithm.
This is the first paper proposing a universal approach with state-of-the-art dynamic regret guarantees even for general convex cost functions.
arXiv Detail & Related papers (2023-02-13T03:00:45Z) - ANACONDA: An Improved Dynamic Regret Algorithm for Adaptive
Non-Stationary Dueling Bandits [20.128001589147512]
We study the problem of non-stationary dueling bandits and provide the first adaptive dynamic regret algorithm for this problem.
We show a near-optimal $tildeO(sqrtStexttCW T)$ dynamic regret bound, where $StexttCW$ is the number of times the Condorcet winner changes in $T$ rounds.
arXiv Detail & Related papers (2022-10-25T20:26:02Z) - Dynamic Regret of Adaptive Gradient Methods for Strongly Convex Problems [0.0]
We go through a variant of ADAGRAD (referred to as M-ADAGRAD) in a strong convex setting via the notion of dynamic regret.
We demonstrate a regret bound in terms of the path-length of the minimizer sequence that essentially reflects the non-stationarity of environments.
arXiv Detail & Related papers (2022-09-04T12:40:57Z) - Learning Robust Policy against Disturbance in Transition Dynamics via
State-Conservative Policy Optimization [63.75188254377202]
Deep reinforcement learning algorithms can perform poorly in real-world tasks due to discrepancy between source and target environments.
We propose a novel model-free actor-critic algorithm to learn robust policies without modeling the disturbance in advance.
Experiments in several robot control tasks demonstrate that SCPO learns robust policies against the disturbance in transition dynamics.
arXiv Detail & Related papers (2021-12-20T13:13:05Z) - Robust Value Iteration for Continuous Control Tasks [99.00362538261972]
When transferring a control policy from simulation to a physical system, the policy needs to be robust to variations in the dynamics to perform well.
We present Robust Fitted Value Iteration, which uses dynamic programming to compute the optimal value function on the compact state domain.
We show that robust value is more robust compared to deep reinforcement learning algorithm and the non-robust version of the algorithm.
arXiv Detail & Related papers (2021-05-25T19:48:35Z) - Value Iteration in Continuous Actions, States and Time [99.00362538261972]
We propose a continuous fitted value iteration (cFVI) algorithm for continuous states and actions.
The optimal policy can be derived for non-linear control-affine dynamics.
Videos of the physical system are available at urlhttps://sites.google.com/view/value-iteration.
arXiv Detail & Related papers (2021-05-10T21:40:56Z) - Non-stationary Online Learning with Memory and Non-stochastic Control [71.14503310914799]
We study the problem of Online Convex Optimization (OCO) with memory, which allows loss functions to depend on past decisions.
In this paper, we introduce dynamic policy regret as the performance measure to design algorithms robust to non-stationary environments.
We propose a novel algorithm for OCO with memory that provably enjoys an optimal dynamic policy regret in terms of time horizon, non-stationarity measure, and memory length.
arXiv Detail & Related papers (2021-02-07T09:45:15Z) - Minimizing Dynamic Regret and Adaptive Regret Simultaneously [60.17824125301273]
We propose novel online algorithms that are able to minimize the dynamic regret and adaptive regret simultaneously.
Our theoretical guarantee is even stronger in the sense that one algorithm is able to minimize the dynamic regret over any interval.
arXiv Detail & Related papers (2020-02-06T03:32:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.