Acting in Delayed Environments with Non-Stationary Markov Policies
- URL: http://arxiv.org/abs/2101.11992v4
- Date: Wed, 13 Dec 2023 02:40:47 GMT
- Title: Acting in Delayed Environments with Non-Stationary Markov Policies
- Authors: Esther Derman and Gal Dalal, Shie Mannor
- Abstract summary: We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that are executed with a delay of $m$ steps.
We prove that with execution delay, deterministic Markov policies in the original state-space are sufficient for attaining maximal reward, but need to be non-stationary.
We devise a non-stationary Q-learning style model-based algorithm that solves delayed execution tasks without resorting to state-augmentation.
- Score: 57.52103323209643
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The standard Markov Decision Process (MDP) formulation hinges on the
assumption that an action is executed immediately after it was chosen. However,
assuming it is often unrealistic and can lead to catastrophic failures in
applications such as robotic manipulation, cloud computing, and finance. We
introduce a framework for learning and planning in MDPs where the
decision-maker commits actions that are executed with a delay of $m$ steps. The
brute-force state augmentation baseline where the state is concatenated to the
last $m$ committed actions suffers from an exponential complexity in $m$, as we
show for policy iteration. We then prove that with execution delay,
deterministic Markov policies in the original state-space are sufficient for
attaining maximal reward, but need to be non-stationary. As for stationary
Markov policies, we show they are sub-optimal in general. Consequently, we
devise a non-stationary Q-learning style model-based algorithm that solves
delayed execution tasks without resorting to state-augmentation. Experiments on
tabular, physical, and Atari domains reveal that it converges quickly to high
performance even for substantial delays, while standard approaches that either
ignore the delay or rely on state-augmentation struggle or fail due to
divergence. The code is available at github.com/galdl/rl_delay_basic and
github.com/galdl/rl_delay_atari.
Related papers
- Model Predictive Control is Almost Optimal for Restless Bandit [2.295863158976069]
We consider the discrete time infinite horizon average reward restless markovian bandit (RMAB) problem.
We propose a emphmodel predictive control based non-stationary policy with a rolling computational horizon $tau$.
We show that its sub-optimality gap is $O(1/sqrtN)$ in general, and $exp(-Omega(N))$ under a local-stability condition.
arXiv Detail & Related papers (2024-10-08T19:34:23Z) - Tree Search-Based Policy Optimization under Stochastic Execution Delay [46.849634120584646]
Delayed execution MDPs are a new formalism addressing random delays without resorting to state augmentation.
We show that given observed delay values, it is sufficient to perform a policy search in the class of Markov policies.
We devise DEZ, a model-based algorithm that optimize over the class of Markov policies.
arXiv Detail & Related papers (2024-04-08T12:19:04Z) - Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling.
Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z) - Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs [17.62509045102346]
This paper considers the best policy identification problem in online Constrained Markov Decision Processes (CMDPs)
We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability.
Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy.
arXiv Detail & Related papers (2023-09-27T04:33:09Z) - Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space [0.0]
We study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes.
We propose an algorithm based on Thompson sampling with dynamically-sized episodes.
We show that our algorithm can be applied to develop approximately optimal control algorithms.
arXiv Detail & Related papers (2023-06-05T03:57:16Z) - Horizon-Free Reinforcement Learning in Polynomial Time: the Power of
Stationary Policies [88.75843804630772]
We design an algorithm that achieves an $Oleft(mathrmpoly(S,A,log K)sqrtKright)$ regret in contrast to existing bounds.
Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies.
arXiv Detail & Related papers (2022-03-24T08:14:12Z) - Minimax Regret for Stochastic Shortest Path [63.45407095296692]
We study the Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost.
We show that the minimax regret for this setting is $widetilde O(B_star sqrt|S| |A| K)$ where $B_star$ is a bound on the expected cost of the optimal policy from any state.
Our algorithm runs in-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs.
arXiv Detail & Related papers (2021-03-24T10:11:49Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - Provably Efficient Model-Free Algorithm for MDPs with Peak Constraints [38.2783003051101]
This paper considers the peak Constrained Markov Decision Process (PCMDP), where the agent chooses the policy to maximize total reward in the finite horizon as well as satisfy constraints at each epoch with probability 1.
We propose a model-free algorithm that converts PCMDP problem to an unconstrained problem and a Q-learning based approach is applied.
arXiv Detail & Related papers (2020-03-11T23:23:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.