Related papers: Provably Efficient Model-Free Algorithm for MDPs with Peak Constraints

Provably Efficient Model-Free Algorithm for MDPs with Peak Constraints

URL: http://arxiv.org/abs/2003.05555v6
Date: Mon, 13 Jun 2022 22:02:32 GMT
Title: Provably Efficient Model-Free Algorithm for MDPs with Peak Constraints
Authors: Qinbo Bai and Vaneet Aggarwal and Ather Gattami
Abstract summary: This paper considers the peak Constrained Markov Decision Process (PCMDP), where the agent chooses the policy to maximize total reward in the finite horizon as well as satisfy constraints at each epoch with probability 1. We propose a model-free algorithm that converts PCMDP problem to an unconstrained problem and a Q-learning based approach is applied.
Score: 38.2783003051101
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the optimization of dynamic systems, the variables typically have constraints. Such problems can be modeled as a Constrained Markov Decision Process (CMDP). This paper considers the peak Constrained Markov Decision Process (PCMDP), where the agent chooses the policy to maximize total reward in the finite horizon as well as satisfy constraints at each epoch with probability 1. We propose a model-free algorithm that converts PCMDP problem to an unconstrained problem and a Q-learning based approach is applied. We define the concept of probably approximately correct (PAC) to the proposed PCMDP problem. The proposed algorithm is proved to achieve an $(\epsilon,p)$-PAC policy when the episode $K\geq\Omega(\frac{I^2H^6SA\ell}{\epsilon^2})$, where $S$ and $A$ are the number of states and actions, respectively. $H$ is the number of epochs per episode. $I$ is the number of constraint functions, and $\ell=\log(\frac{SAT}{p})$. We note that this is the first result on PAC kind of analysis for PCMDP with peak constraints, where the transition dynamics are not known apriori. We demonstrate the proposed algorithm on an energy harvesting problem and a single machine scheduling problem, where it performs close to the theoretical upper bound of the studied optimization problem.

Related papers

Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z)
Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs [17.62509045102346]
This paper considers the best policy identification problem in online Constrained Markov Decision Processes (CMDPs) We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability. Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy.
arXiv Detail & Related papers (2023-09-27T04:33:09Z)
Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space [0.0]
We study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes. We propose an algorithm based on Thompson sampling with dynamically-sized episodes. We show that our algorithm can be applied to develop approximately optimal control algorithms.
arXiv Detail & Related papers (2023-06-05T03:57:16Z)
Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP) We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z)
Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs) Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model. The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z)
Efficient Algorithms for Planning with Participation Constraints [74.74967476995572]
We consider the problem of planning with participation constraints introduced in [Zhang et al., 2022] In this problem, a principal chooses actions in a decision process, resulting in separate utilities for the principal and the agent. We provide the first-time exact algorithm for this problem for finite-horizon settings, where previously only an additive $varepsilon$-approximation was known.
arXiv Detail & Related papers (2022-05-16T15:47:41Z)
Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs [24.256960622176305]
We propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in episodic Markov decision processes. Our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap. Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts.
arXiv Detail & Related papers (2022-03-17T11:19:41Z)
Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach [37.80609997145897]
Reinforcement learning is widely used in applications where one needs to perform sequential decisions while interacting with the environment. The problem becomes more challenging when the decision requirement includes satisfying some safety constraints. Various algorithms are available to solve CMDP problems in a model-free manner to achieve $epsilon$-optimal cumulative reward with $epsilon$ feasible policies. An important question here is whether we can achieve $epsilon$-optimal cumulative reward with zero constraint violations or not.
arXiv Detail & Related papers (2021-09-13T21:27:03Z)
A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with Constraints [8.849815837266977]
Constrained Markov Decision Processes (CMDPs) formalize sequential decision-making problems. We propose an online algorithm which leverages the linear programming formulation of finite-horizon CMDP for repeated optimistic planning.
arXiv Detail & Related papers (2020-09-23T19:30:46Z)
Model-Free Algorithm and Regret Analysis for MDPs with Long-Term Constraints [38.2783003051101]
This paper uses concepts from constrained optimization and Q-learning to propose an algorithm for CMDP with long-term constraints. We note that these are the first results on regret analysis for MDP with long-term constraints, where the transition probabilities are not known apriori.
arXiv Detail & Related papers (2020-06-10T17:19:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.