State Augmented Constrained Reinforcement Learning: Overcoming the
Limitations of Learning with Rewards
- URL: http://arxiv.org/abs/2102.11941v2
- Date: Thu, 21 Sep 2023 14:36:25 GMT
- Title: State Augmented Constrained Reinforcement Learning: Overcoming the
Limitations of Learning with Rewards
- Authors: Miguel Calvo-Fullana, Santiago Paternain, Luiz F. O. Chamon, Alejandro
Ribeiro
- Abstract summary: A common formulation of constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds.
We show a simple example in which the desired optimal policy cannot be induced by any weighted linear combination of rewards.
This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods.
- Score: 88.30521204048551
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A common formulation of constrained reinforcement learning involves multiple
rewards that must individually accumulate to given thresholds. In this class of
problems, we show a simple example in which the desired optimal policy cannot
be induced by any weighted linear combination of rewards. Hence, there exist
constrained reinforcement learning problems for which neither regularized nor
classical primal-dual methods yield optimal policies. This work addresses this
shortcoming by augmenting the state with Lagrange multipliers and
reinterpreting primal-dual methods as the portion of the dynamics that drives
the multipliers evolution. This approach provides a systematic state
augmentation procedure that is guaranteed to solve reinforcement learning
problems with constraints. Thus, as we illustrate by an example, while previous
methods can fail at finding optimal policies, running the dual dynamics while
executing the augmented policy yields an algorithm that provably samples
actions from the optimal policy.
Related papers
- Adversarial Constrained Policy Optimization: Improving Constrained Reinforcement Learning by Adapting Budgets [6.5472155063246085]
Constrained reinforcement learning has achieved promising progress in safety-critical fields where both rewards and constraints are considered.
We propose Adrial Constrained Policy Optimization (ACPO), which enables simultaneous optimization of reward and the adaptation of cost budgets during training.
arXiv Detail & Related papers (2024-10-28T07:04:32Z) - A Dual Perspective of Reinforcement Learning for Imposing Policy Constraints [0.0]
We use a generic primal-dual framework for value-based and actor-critic reinforcement learning methods.
The obtained dual formulations turn out to be especially useful for imposing additional constraints on the learned policy.
A practical algorithm is derived that supports various combinations of policy constraints that are automatically handled throughout training.
arXiv Detail & Related papers (2024-04-25T09:50:57Z) - Constrained Reinforcement Learning via Dissipative Saddle Flow Dynamics [5.270497591225775]
In constrained reinforcement learning (C-RL), an agent seeks to learn from the environment a policy that maximizes the expected cumulative reward.
Several algorithms rooted in sampled-based primal-dual methods have been recently proposed to solve this problem in policy space.
We propose a novel algorithm for constrained RL that does not suffer from these limitations.
arXiv Detail & Related papers (2022-12-03T01:54:55Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step
Q-learning: A Novel Correction Approach [0.0]
We introduce a novel policy similarity measure to mitigate the effects of such discrepancy in continuous control.
Our method offers an adequate single-step off-policy correction that is applicable to deterministic policy networks.
arXiv Detail & Related papers (2022-08-01T11:33:12Z) - Penalized Proximal Policy Optimization for Safe Reinforcement Learning [68.86485583981866]
We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem.
P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective.
We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
arXiv Detail & Related papers (2022-05-24T06:15:51Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Ensuring Monotonic Policy Improvement in Entropy-regularized Value-based
Reinforcement Learning [14.325835899564664]
entropy-regularized value-based reinforcement learning method can ensure the monotonic improvement of policies at each policy update.
We propose a novel reinforcement learning algorithm that exploits this lower-bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation.
arXiv Detail & Related papers (2020-08-25T04:09:18Z) - Novel Policy Seeking with Constrained Optimization [131.67409598529287]
We propose to rethink the problem of generating novel policies in reinforcement learning tasks.
We first introduce a new metric to evaluate the difference between policies and then design two practical novel policy generation methods.
The two proposed methods, namely the Constrained Task Novel Bisector (CTNB) and the Interior Policy Differentiation (IPD), are derived from the feasible direction method and the interior point method commonly known in the constrained optimization literature.
arXiv Detail & Related papers (2020-05-21T14:39:14Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.