Model-Free Algorithm and Regret Analysis for MDPs with Long-Term
Constraints
- URL: http://arxiv.org/abs/2006.05961v2
- Date: Sat, 30 Jan 2021 21:06:30 GMT
- Title: Model-Free Algorithm and Regret Analysis for MDPs with Long-Term
Constraints
- Authors: Qinbo Bai and Vaneet Aggarwal and Ather Gattami
- Abstract summary: This paper uses concepts from constrained optimization and Q-learning to propose an algorithm for CMDP with long-term constraints.
We note that these are the first results on regret analysis for MDP with long-term constraints, where the transition probabilities are not known apriori.
- Score: 38.2783003051101
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the optimization of dynamical systems, the variables typically have
constraints. Such problems can be modeled as a constrained Markov Decision
Process (CMDP). This paper considers a model-free approach to the problem,
where the transition probabilities are not known. In the presence of long-term
(or average) constraints, the agent has to choose a policy that maximizes the
long-term average reward as well as satisfy the average constraints in each
episode. The key challenge with the long-term constraints is that the optimal
policy is not deterministic in general, and thus standard Q-learning approaches
cannot be directly used. This paper uses concepts from constrained optimization
and Q-learning to propose an algorithm for CMDP with long-term constraints. For
any $\gamma\in(0,\frac{1}{2})$, the proposed algorithm is shown to achieve
$O(T^{1/2+\gamma})$ regret bound for the obtained reward and
$O(T^{1-\gamma/2})$ regret bound for the constraint violation, where $T$ is the
total number of steps. We note that these are the first results on regret
analysis for MDP with long-term constraints, where the transition probabilities
are not known apriori.
Related papers
- Two-Stage ML-Guided Decision Rules for Sequential Decision Making under Uncertainty [55.06411438416805]
Sequential Decision Making under Uncertainty (SDMU) is ubiquitous in many domains such as energy, finance, and supply chains.
Some SDMU are naturally modeled as Multistage Problems (MSPs) but the resulting optimizations are notoriously challenging from a computational standpoint.
This paper introduces a novel approach Two-Stage General Decision Rules (TS-GDR) to generalize the policy space beyond linear functions.
The effectiveness of TS-GDR is demonstrated through an instantiation using Deep Recurrent Neural Networks named Two-Stage Deep Decision Rules (TS-LDR)
arXiv Detail & Related papers (2024-05-23T18:19:47Z) - Truly No-Regret Learning in Constrained MDPs [61.78619476991494]
We propose a model-based primal-dual algorithm to learn in an unknown CMDP.
We prove that our algorithm achieves sublinear regret without error cancellations.
arXiv Detail & Related papers (2024-02-24T09:47:46Z) - Anytime-Constrained Reinforcement Learning [6.981971551979697]
We introduce and study constrained Markov Decision Processes (cMDPs) with anytime constraints.
We show that there exist optimal deterministic policies augmented with cumulative costs.
We show that computing non-trivial approximately optimal policies is NP-hard in general.
arXiv Detail & Related papers (2023-11-09T16:51:26Z) - Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs [17.62509045102346]
This paper considers the best policy identification problem in online Constrained Markov Decision Processes (CMDPs)
We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability.
Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy.
arXiv Detail & Related papers (2023-09-27T04:33:09Z) - A Unifying Framework for Online Optimization with Long-Term Constraints [62.35194099438855]
We study online learning problems in which a decision maker has to take a sequence of decisions subject to $m$ long-term constraints.
The goal is to maximize their total reward, while at the same time achieving small cumulative violation across the $T$ rounds.
We present the first best-of-both-world type algorithm for this general class problems, with no-regret guarantees both in the case in which rewards and constraints are selected according to an unknown model, and in the case in which they are selected at each round by an adversary.
arXiv Detail & Related papers (2022-09-15T16:59:19Z) - Near-Optimal Sample Complexity Bounds for Constrained MDPs [25.509556551558834]
We provide minimax upper and lower bounds on the sample complexity for learning near-optimal policies in a discounted CMDP.
We show that learning CMDPs is as easy as MDPs when small constraint violations are allowed, but inherently more difficult when we demand zero constraint violation.
arXiv Detail & Related papers (2022-06-13T15:58:14Z) - Efficient Algorithms for Planning with Participation Constraints [74.74967476995572]
We consider the problem of planning with participation constraints introduced in [Zhang et al., 2022]
In this problem, a principal chooses actions in a decision process, resulting in separate utilities for the principal and the agent.
We provide the first-time exact algorithm for this problem for finite-horizon settings, where previously only an additive $varepsilon$-approximation was known.
arXiv Detail & Related papers (2022-05-16T15:47:41Z) - Achieving Zero Constraint Violation for Constrained Reinforcement
Learning via Primal-Dual Approach [37.80609997145897]
Reinforcement learning is widely used in applications where one needs to perform sequential decisions while interacting with the environment.
The problem becomes more challenging when the decision requirement includes satisfying some safety constraints.
Various algorithms are available to solve CMDP problems in a model-free manner to achieve $epsilon$-optimal cumulative reward with $epsilon$ feasible policies.
An important question here is whether we can achieve $epsilon$-optimal cumulative reward with zero constraint violations or not.
arXiv Detail & Related papers (2021-09-13T21:27:03Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - Provably Efficient Model-Free Algorithm for MDPs with Peak Constraints [38.2783003051101]
This paper considers the peak Constrained Markov Decision Process (PCMDP), where the agent chooses the policy to maximize total reward in the finite horizon as well as satisfy constraints at each epoch with probability 1.
We propose a model-free algorithm that converts PCMDP problem to an unconstrained problem and a Q-learning based approach is applied.
arXiv Detail & Related papers (2020-03-11T23:23:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.