Exploration-Exploitation in Constrained MDPs
- URL: http://arxiv.org/abs/2003.02189v1
- Date: Wed, 4 Mar 2020 17:03:56 GMT
- Title: Exploration-Exploitation in Constrained MDPs
- Authors: Yonathan Efroni and Shie Mannor and Matteo Pirotta
- Abstract summary: We investigate the exploration-exploitation dilemma in Constrained Markov Decision Processes (CMDPs)
While learning in an unknown CMDP, an agent should trade-off exploration to discover new information about the MDP.
While the agent will eventually learn a good or optimal policy, we do not want the agent to violate the constraints too often during the learning process.
- Score: 79.23623305214275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many sequential decision-making problems, the goal is to optimize a
utility function while satisfying a set of constraints on different utilities.
This learning problem is formalized through Constrained Markov Decision
Processes (CMDPs). In this paper, we investigate the exploration-exploitation
dilemma in CMDPs. While learning in an unknown CMDP, an agent should trade-off
exploration to discover new information about the MDP, and exploitation of the
current knowledge to maximize the reward while satisfying the constraints.
While the agent will eventually learn a good or optimal policy, we do not want
the agent to violate the constraints too often during the learning process. In
this work, we analyze two approaches for learning in CMDPs. The first approach
leverages the linear formulation of CMDP to perform optimistic planning at each
episode. The second approach leverages the dual formulation (or saddle-point
formulation) of CMDP to perform incremental, optimistic updates of the primal
and dual variables. We show that both achieves sublinear regret w.r.t.\ the
main utility while having a sublinear regret on the constraint violations. That
being said, we highlight a crucial difference between the two approaches; the
linear programming approach results in stronger guarantees than in the dual
formulation based approach.
Related papers
- Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation [24.299769025346368]
We study the reinforcement learning problem in a constrained decision process (CMDP)
We propose an RL algorithm for linear CMDPs that achieves $tildemathcalO(sqrtK)$ regret with an episode-wise zero-violation guarantee.
Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.
arXiv Detail & Related papers (2025-02-14T13:07:25Z) - Efficiently Training Deep-Learning Parametric Policies using Lagrangian Duality [55.06411438416805]
Constrained Markov Decision Processes (CMDPs) are critical in many high-stakes applications.
This paper introduces a novel approach, Two-Stage Deep Decision Rules (TS- DDR) to efficiently train parametric actor policies.
It is shown to enhance solution quality and to reduce computation times by several orders of magnitude when compared to current state-of-the-art methods.
arXiv Detail & Related papers (2024-05-23T18:19:47Z) - Reward-Punishment Reinforcement Learning with Maximum Entropy [3.123049150077741]
We introduce the soft Deep MaxPain'' (softDMP) algorithm, which integrates the optimization of long-term policy entropy into reward-punishment reinforcement learning objectives.
Our motivation is to facilitate a smoother variation of operators utilized in the updating of action values beyond traditional max'' and min'' operators.
arXiv Detail & Related papers (2024-05-20T05:05:14Z) - Learning Adversarial MDPs with Stochastic Hard Constraints [37.24692425018]
We study online learning in constrained Markov decision processes (CMDPs) with adversarial losses and hard constraints.
Our work is the first to study CMDPs involving both adversarial losses and hard constraints.
arXiv Detail & Related papers (2024-03-06T12:49:08Z) - Truly No-Regret Learning in Constrained MDPs [61.78619476991494]
We propose a model-based primal-dual algorithm to learn in an unknown CMDP.
We prove that our algorithm achieves sublinear regret without error cancellations.
arXiv Detail & Related papers (2024-02-24T09:47:46Z) - Model-based Constrained MDP for Budget Allocation in Sequential
Incentive Marketing [28.395877073390434]
Sequential incentive marketing is an important approach for online businesses to acquire customers, increase loyalty and boost sales.
How to effectively allocate the incentives so as to maximize the return under the budget constraint is less studied in the literature.
We propose an efficient learning algorithm which combines bisection search and model-based planning.
arXiv Detail & Related papers (2023-03-02T08:10:45Z) - Safe Exploration by Solving Early Terminated MDP [77.10563395197045]
We introduce a new approach to address safe RL problems under the framework of Early TerminatedP (ET-MDP)
We first define the ET-MDP as an unconstrained algorithm with the same optimal value function as its corresponding CMDP.
An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better performance and improved learning efficiency.
arXiv Detail & Related papers (2021-07-09T04:24:40Z) - Learning MDPs from Features: Predict-Then-Optimize for Sequential
Decision Problems by Reinforcement Learning [52.74071439183113]
We study the predict-then-optimize framework in the context of sequential decision problems (formulated as MDPs) solved via reinforcement learning.
Two significant computational challenges arise in applying decision-focused learning to MDPs.
arXiv Detail & Related papers (2021-06-06T23:53:31Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.