A Provably-Efficient Model-Free Algorithm for Constrained Markov
Decision Processes
- URL: http://arxiv.org/abs/2106.01577v1
- Date: Thu, 3 Jun 2021 03:53:27 GMT
- Title: A Provably-Efficient Model-Free Algorithm for Constrained Markov
Decision Processes
- Authors: Honghao Wei, Xin Liu, Lei Ying
- Abstract summary: This paper presents the first em model-free, em simulator-free reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation.
The algorithm is named Triple-Q because it has three key components: a Q-function for the cumulative reward, a Q-function for the cumulative utility for the constraint, and a virtual-Queue that (over)-estimates the cumulative constraint violation.
- Score: 13.877420496703627
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents the first {\em model-free}, {\em simulator-free}
reinforcement learning algorithm for Constrained Markov Decision Processes
(CMDPs) with sublinear regret and zero constraint violation. The algorithm is
named Triple-Q because it has three key components: a Q-function (also called
action-value function) for the cumulative reward, a Q-function for the
cumulative utility for the constraint, and a virtual-Queue that
(over)-estimates the cumulative constraint violation. Under Triple-Q, at each
step, an action is chosen based on the pseudo-Q-value that is a combination of
the three Q values. The algorithm updates the reward and utility Q-values with
learning rates that depend on the visit counts to the corresponding (state,
action) pairs and are periodically reset. In the episodic CMDP setting,
Triple-Q achieves $\tilde{\cal O}\left(\frac{1 }{\delta}H^4
S^{\frac{1}{2}}A^{\frac{1}{2}}K^{\frac{4}{5}} \right)$ regret, where $K$ is the
total number of episodes, $H$ is the number of steps in each episode, $S$ is
the number of states, $A$ is the number of actions, and $\delta$ is Slater's
constant. Furthermore, Triple-Q guarantees zero constraint violation when $K$
is sufficiently large. Finally, the computational complexity of Triple-Q is
similar to SARSA for unconstrained MDPs and is computationally efficient.
Related papers
- Computational Supremacy of Quantum Eigensolver by Extension of Optimized Binary Configurations [0.0]
We develop a quantum eigensolver based on a D-Wave Quantum Annealer (D-Wave QA)
This approach performs iterative QA measurements to optimize the eigenstates $vert psi rangle$ without the derivation of a classical computer.
We confirm that the proposed QE algorithm provides exact solutions within the errors of $5 times 10-3$.
arXiv Detail & Related papers (2024-06-05T15:19:53Z) - Average-Case Complexity of Tensor Decomposition for Low-Degree
Polynomials [93.59919600451487]
"Statistical-computational gaps" occur in many statistical inference tasks.
We consider a model for random order-3 decomposition where one component is slightly larger in norm than the rest.
We show that tensor entries can accurately estimate the largest component when $ll n3/2$ but fail to do so when $rgg n3/2$.
arXiv Detail & Related papers (2022-11-10T00:40:37Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Polyak-Ruppert Averaged Q-Leaning is Statistically Efficient [90.14768299744792]
We study synchronous Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-leaning) in a $gamma$-discounted MDP.
We establish normality for the iteration averaged $barboldsymbolQ_T$.
In short, our theoretical analysis shows averaged Q-Leaning is statistically efficient.
arXiv Detail & Related papers (2021-12-29T14:47:56Z) - Tightening the Dependence on Horizon in the Sample Complexity of
Q-Learning [59.71676469100807]
This work sharpens the sample complexity of synchronous Q-learning to an order of $frac|mathcalS|| (1-gamma)4varepsilon2$ for any $0varepsilon 1$.
Our finding unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage.
arXiv Detail & Related papers (2021-02-12T14:22:05Z) - Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in
Multi-Agent RL and Inventory Control [28.80743320843154]
We propose Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), the first model-free algorithm for non-stationary RL.
We show that our algorithms are emphnearly optimal by establishing an information-theoretical lower bound of $Omega(Sfrac13 Afrac13 Deltafrac13 Hfrac23 Tfrac23)$, the first lower bound in non-stationary RL.
arXiv Detail & Related papers (2020-10-07T04:55:56Z) - Finite-Time Analysis for Double Q-learning [50.50058000948908]
We provide the first non-asymptotic, finite-time analysis for double Q-learning.
We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum.
arXiv Detail & Related papers (2020-09-29T18:48:21Z) - A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with
Constraints [8.849815837266977]
Constrained Markov Decision Processes (CMDPs) formalize sequential decision-making problems.
We propose an online algorithm which leverages the linear programming formulation of finite-horizon CMDP for repeated optimistic planning.
arXiv Detail & Related papers (2020-09-23T19:30:46Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z) - Provably Efficient Model-Free Algorithm for MDPs with Peak Constraints [38.2783003051101]
This paper considers the peak Constrained Markov Decision Process (PCMDP), where the agent chooses the policy to maximize total reward in the finite horizon as well as satisfy constraints at each epoch with probability 1.
We propose a model-free algorithm that converts PCMDP problem to an unconstrained problem and a Q-learning based approach is applied.
arXiv Detail & Related papers (2020-03-11T23:23:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.