Related papers: Exploration-Exploitation in Constrained MDPs

Exploration-Exploitation in Constrained MDPs

URL: http://arxiv.org/abs/2003.02189v1
Date: Wed, 4 Mar 2020 17:03:56 GMT
Title: Exploration-Exploitation in Constrained MDPs
Authors: Yonathan Efroni and Shie Mannor and Matteo Pirotta
Abstract summary: We investigate the exploration-exploitation dilemma in Constrained Markov Decision Processes (CMDPs) While learning in an unknown CMDP, an agent should trade-off exploration to discover new information about the MDP. While the agent will eventually learn a good or optimal policy, we do not want the agent to violate the constraints too often during the learning process.
Score: 79.23623305214275
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In many sequential decision-making problems, the goal is to optimize a utility function while satisfying a set of constraints on different utilities. This learning problem is formalized through Constrained Markov Decision Processes (CMDPs). In this paper, we investigate the exploration-exploitation dilemma in CMDPs. While learning in an unknown CMDP, an agent should trade-off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward while satisfying the constraints. While the agent will eventually learn a good or optimal policy, we do not want the agent to violate the constraints too often during the learning process. In this work, we analyze two approaches for learning in CMDPs. The first approach leverages the linear formulation of CMDP to perform optimistic planning at each episode. The second approach leverages the dual formulation (or saddle-point formulation) of CMDP to perform incremental, optimistic updates of the primal and dual variables. We show that both achieves sublinear regret w.r.t.\ the main utility while having a sublinear regret on the constraint violations. That being said, we highlight a crucial difference between the two approaches; the linear programming approach results in stronger guarantees than in the dual formulation based approach.

Related papers

Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation [24.299769025346368]
We study the reinforcement learning problem in a constrained decision process (CMDP) We propose an RL algorithm for linear CMDPs that achieves $tildemathcalO(sqrtK)$ regret with an episode-wise zero-violation guarantee. Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.
arXiv Detail & Related papers (2025-02-14T13:07:25Z)
Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs [16.49229317664822]
We study the problem of infinite-horizon average-reward reinforcement learning with linear decision processes (MDPs) Our approach approximates the average-reward setting by a discounted discounting factor, then applies an optimistic value iteration.
arXiv Detail & Related papers (2024-05-23T20:58:33Z)
Two-Stage ML-Guided Decision Rules for Sequential Decision Making under Uncertainty [55.06411438416805]
Sequential Decision Making under Uncertainty (SDMU) is ubiquitous in many domains such as energy, finance, and supply chains. Some SDMU are naturally modeled as Multistage Problems (MSPs) but the resulting optimizations are notoriously challenging from a computational standpoint. This paper introduces a novel approach Two-Stage General Decision Rules (TS-GDR) to generalize the policy space beyond linear functions. The effectiveness of TS-GDR is demonstrated through an instantiation using Deep Recurrent Neural Networks named Two-Stage Deep Decision Rules (TS-LDR)
arXiv Detail & Related papers (2024-05-23T18:19:47Z)
Reward-Punishment Reinforcement Learning with Maximum Entropy [3.123049150077741]
We introduce the soft Deep MaxPain'' (softDMP) algorithm, which integrates the optimization of long-term policy entropy into reward-punishment reinforcement learning objectives. Our motivation is to facilitate a smoother variation of operators utilized in the updating of action values beyond traditional max'' and min'' operators.
arXiv Detail & Related papers (2024-05-20T05:05:14Z)
Learning Adversarial MDPs with Stochastic Hard Constraints [37.24692425018]
We study online learning problems in constrained decision processes with adversarial losses and hard constraints. We design an algorithm that achieves sublinear regret while ensuring that the constraints are satisfied at every episode with high probability.
arXiv Detail & Related papers (2024-03-06T12:49:08Z)
Truly No-Regret Learning in Constrained MDPs [61.78619476991494]
We propose a model-based primal-dual algorithm to learn in an unknown CMDP. We prove that our algorithm achieves sublinear regret without error cancellations.
arXiv Detail & Related papers (2024-02-24T09:47:46Z)
Model-based Constrained MDP for Budget Allocation in Sequential Incentive Marketing [28.395877073390434]
Sequential incentive marketing is an important approach for online businesses to acquire customers, increase loyalty and boost sales. How to effectively allocate the incentives so as to maximize the return under the budget constraint is less studied in the literature. We propose an efficient learning algorithm which combines bisection search and model-based planning.
arXiv Detail & Related papers (2023-03-02T08:10:45Z)
Safe Exploration by Solving Early Terminated MDP [77.10563395197045]
We introduce a new approach to address safe RL problems under the framework of Early TerminatedP (ET-MDP) We first define the ET-MDP as an unconstrained algorithm with the same optimal value function as its corresponding CMDP. An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better performance and improved learning efficiency.
arXiv Detail & Related papers (2021-07-09T04:24:40Z)
Learning MDPs from Features: Predict-Then-Optimize for Sequential Decision Problems by Reinforcement Learning [52.74071439183113]
We study the predict-then-optimize framework in the context of sequential decision problems (formulated as MDPs) solved via reinforcement learning. Two significant computational challenges arise in applying decision-focused learning to MDPs.
arXiv Detail & Related papers (2021-06-06T23:53:31Z)
Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs) We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.