Exploration-Exploitation in Constrained MDPs
- URL: http://arxiv.org/abs/2003.02189v1
- Date: Wed, 4 Mar 2020 17:03:56 GMT
- Title: Exploration-Exploitation in Constrained MDPs
- Authors: Yonathan Efroni and Shie Mannor and Matteo Pirotta
- Abstract summary: We investigate the exploration-exploitation dilemma in Constrained Markov Decision Processes (CMDPs)
While learning in an unknown CMDP, an agent should trade-off exploration to discover new information about the MDP.
While the agent will eventually learn a good or optimal policy, we do not want the agent to violate the constraints too often during the learning process.
- Score: 79.23623305214275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many sequential decision-making problems, the goal is to optimize a
utility function while satisfying a set of constraints on different utilities.
This learning problem is formalized through Constrained Markov Decision
Processes (CMDPs). In this paper, we investigate the exploration-exploitation
dilemma in CMDPs. While learning in an unknown CMDP, an agent should trade-off
exploration to discover new information about the MDP, and exploitation of the
current knowledge to maximize the reward while satisfying the constraints.
While the agent will eventually learn a good or optimal policy, we do not want
the agent to violate the constraints too often during the learning process. In
this work, we analyze two approaches for learning in CMDPs. The first approach
leverages the linear formulation of CMDP to perform optimistic planning at each
episode. The second approach leverages the dual formulation (or saddle-point
formulation) of CMDP to perform incremental, optimistic updates of the primal
and dual variables. We show that both achieves sublinear regret w.r.t.\ the
main utility while having a sublinear regret on the constraint violations. That
being said, we highlight a crucial difference between the two approaches; the
linear programming approach results in stronger guarantees than in the dual
formulation based approach.
Related papers
- Two-Stage ML-Guided Decision Rules for Sequential Decision Making under Uncertainty [55.06411438416805]
Sequential Decision Making under Uncertainty (SDMU) is ubiquitous in many domains such as energy, finance, and supply chains.
Some SDMU are naturally modeled as Multistage Problems (MSPs) but the resulting optimizations are notoriously challenging from a computational standpoint.
This paper introduces a novel approach Two-Stage General Decision Rules (TS-GDR) to generalize the policy space beyond linear functions.
The effectiveness of TS-GDR is demonstrated through an instantiation using Deep Recurrent Neural Networks named Two-Stage Deep Decision Rules (TS-LDR)
arXiv Detail & Related papers (2024-05-23T18:19:47Z) - Reward-Punishment Reinforcement Learning with Maximum Entropy [3.123049150077741]
We introduce the soft Deep MaxPain'' (softDMP) algorithm, which integrates the optimization of long-term policy entropy into reward-punishment reinforcement learning objectives.
Our motivation is to facilitate a smoother variation of operators utilized in the updating of action values beyond traditional max'' and min'' operators.
arXiv Detail & Related papers (2024-05-20T05:05:14Z) - Learning Adversarial MDPs with Stochastic Hard Constraints [37.24692425018]
We study online learning problems in constrained decision processes with adversarial losses and hard constraints.
We design an algorithm that achieves sublinear regret while ensuring that the constraints are satisfied at every episode with high probability.
arXiv Detail & Related papers (2024-03-06T12:49:08Z) - Truly No-Regret Learning in Constrained MDPs [61.78619476991494]
We propose a model-based primal-dual algorithm to learn in an unknown CMDP.
We prove that our algorithm achieves sublinear regret without error cancellations.
arXiv Detail & Related papers (2024-02-24T09:47:46Z) - Model-based Constrained MDP for Budget Allocation in Sequential
Incentive Marketing [28.395877073390434]
Sequential incentive marketing is an important approach for online businesses to acquire customers, increase loyalty and boost sales.
How to effectively allocate the incentives so as to maximize the return under the budget constraint is less studied in the literature.
We propose an efficient learning algorithm which combines bisection search and model-based planning.
arXiv Detail & Related papers (2023-03-02T08:10:45Z) - Safe Exploration by Solving Early Terminated MDP [77.10563395197045]
We introduce a new approach to address safe RL problems under the framework of Early TerminatedP (ET-MDP)
We first define the ET-MDP as an unconstrained algorithm with the same optimal value function as its corresponding CMDP.
An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better performance and improved learning efficiency.
arXiv Detail & Related papers (2021-07-09T04:24:40Z) - Learning MDPs from Features: Predict-Then-Optimize for Sequential
Decision Problems by Reinforcement Learning [52.74071439183113]
We study the predict-then-optimize framework in the context of sequential decision problems (formulated as MDPs) solved via reinforcement learning.
Two significant computational challenges arise in applying decision-focused learning to MDPs.
arXiv Detail & Related papers (2021-06-06T23:53:31Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.