Dynamic Regret of Online Markov Decision Processes
- URL: http://arxiv.org/abs/2208.12483v1
- Date: Fri, 26 Aug 2022 07:42:53 GMT
- Title: Dynamic Regret of Online Markov Decision Processes
- Authors: Peng Zhao and Long-Fei Li and Zhi-Hua Zhou
- Abstract summary: We investigate online Markov Decision Processes (MDPs) with adversarially changing loss functions and known transitions.
We choose dynamic regret as the performance measure, defined as the performance difference between the learner and any sequence of feasible changing policies.
We consider three foundational models of online MDPs, including episodic loop-free Shortest Path (SSP), episodic SSP, and infinite-horizon MDPs.
- Score: 84.20723936192945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate online Markov Decision Processes (MDPs) with adversarially
changing loss functions and known transitions. We choose dynamic regret as the
performance measure, defined as the performance difference between the learner
and any sequence of feasible changing policies. The measure is strictly
stronger than the standard static regret that benchmarks the learner's
performance with a fixed compared policy. We consider three foundational models
of online MDPs, including episodic loop-free Stochastic Shortest Path (SSP),
episodic SSP, and infinite-horizon MDPs. For these three models, we propose
novel online ensemble algorithms and establish their dynamic regret guarantees
respectively, in which the results for episodic (loop-free) SSP are provably
minimax optimal in terms of time horizon and certain non-stationarity measure.
Furthermore, when the online environments encountered by the learner are
predictable, we design improved algorithms and achieve better dynamic regret
bounds for the episodic (loop-free) SSP; and moreover, we demonstrate
impossibility results for the infinite-horizon MDPs.
Related papers
- Two-Stage ML-Guided Decision Rules for Sequential Decision Making under Uncertainty [55.06411438416805]
Sequential Decision Making under Uncertainty (SDMU) is ubiquitous in many domains such as energy, finance, and supply chains.
Some SDMU are naturally modeled as Multistage Problems (MSPs) but the resulting optimizations are notoriously challenging from a computational standpoint.
This paper introduces a novel approach Two-Stage General Decision Rules (TS-GDR) to generalize the policy space beyond linear functions.
The effectiveness of TS-GDR is demonstrated through an instantiation using Deep Recurrent Neural Networks named Two-Stage Deep Decision Rules (TS-LDR)
arXiv Detail & Related papers (2024-05-23T18:19:47Z) - Online Reinforcement Learning in Markov Decision Process Using Linear
Programming [1.0878040851638]
We consider online reinforcement learning in episodic Markov decision process (MDP) with unknown transition function and rewards drawn from some fixed but unknown distribution.
We devise a simple and efficient model-based algorithm that achieves $widetildeO(LXsqrtTA)$ regret with high probability.
arXiv Detail & Related papers (2023-03-31T22:21:41Z) - Efficient Online Learning with Memory via Frank-Wolfe Optimization:
Algorithms with Bounded Dynamic Regret and Applications to Control [15.588080817106563]
We introduce the first projection-free meta-base learning algorithm with memory that minimizes dynamic regret.
We are motivated by artificial intelligence applications where autonomous agents need to adapt to time-varying environments.
arXiv Detail & Related papers (2023-01-02T01:12:29Z) - Adaptivity and Non-stationarity: Problem-dependent Dynamic Regret for Online Convex Optimization [70.4342220499858]
We introduce novel online algorithms that can exploit smoothness and replace the dependence on $T$ in dynamic regret with problem-dependent quantities.
Our results are adaptive to the intrinsic difficulty of the problem, since the bounds are tighter than existing results for easy problems and safeguard the same rate in the worst case.
arXiv Detail & Related papers (2021-12-29T02:42:59Z) - Modular Deep Reinforcement Learning for Continuous Motion Planning with
Temporal Logic [59.94347858883343]
This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP)
The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP.
The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states.
arXiv Detail & Related papers (2021-02-24T01:11:25Z) - Non-stationary Online Learning with Memory and Non-stochastic Control [71.14503310914799]
We study the problem of Online Convex Optimization (OCO) with memory, which allows loss functions to depend on past decisions.
In this paper, we introduce dynamic policy regret as the performance measure to design algorithms robust to non-stationary environments.
We propose a novel algorithm for OCO with memory that provably enjoys an optimal dynamic policy regret in terms of time horizon, non-stationarity measure, and memory length.
arXiv Detail & Related papers (2021-02-07T09:45:15Z) - Dynamic Regret of Convex and Smooth Functions [93.71361250701075]
We investigate online convex optimization in non-stationary environments.
We choose the dynamic regret as the performance measure.
We show that it is possible to further enhance the dynamic regret by exploiting the smoothness condition.
arXiv Detail & Related papers (2020-07-07T14:10:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.