Constraint-Aware Reinforcement Learning via Adaptive Action Scaling
- URL: http://arxiv.org/abs/2510.11491v1
- Date: Mon, 13 Oct 2025 14:59:28 GMT
- Title: Constraint-Aware Reinforcement Learning via Adaptive Action Scaling
- Authors: Murad Dawood, Usama Ahmed Siddiquie, Shahram Khorshidi, Maren Bennewitz,
- Abstract summary: We propose a modular cost-aware regulator that scales the agent's actions based on predicted constraint violations.<n>The regulator is trained to minimize constraint violations while avoiding suppression of actions.<n>Our approach integrates seamlessly with off-policy RL methods such as SAC and TD3, and achieves state-of-the-art return-to-cost ratios on Safety Gym locomotion tasks.
- Score: 6.6638441348404855
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Safe reinforcement learning (RL) seeks to mitigate unsafe behaviors that arise from exploration during training by reducing constraint violations while maintaining task performance. Existing approaches typically rely on a single policy to jointly optimize reward and safety, which can cause instability due to conflicting objectives, or they use external safety filters that override actions and require prior system knowledge. In this paper, we propose a modular cost-aware regulator that scales the agent's actions based on predicted constraint violations, preserving exploration through smooth action modulation rather than overriding the policy. The regulator is trained to minimize constraint violations while avoiding degenerate suppression of actions. Our approach integrates seamlessly with off-policy RL methods such as SAC and TD3, and achieves state-of-the-art return-to-cost ratios on Safety Gym locomotion tasks with sparse costs, reducing constraint violations by up to 126 times while increasing returns by over an order of magnitude compared to prior methods.
Related papers
- Safe Reinforcement Learning via Recovery-based Shielding with Gaussian Process Dynamics Models [57.006252510102506]
Reinforcement learning (RL) is a powerful framework for optimal decision-making and control but often lacks provable guarantees for safety-critical applications.<n>We introduce a novel recovery-based shielding framework that enables safe RL with a provable safety lower bound for unknown and non-linear continuous dynamical systems.
arXiv Detail & Related papers (2026-02-12T22:03:35Z) - Boundary-to-Region Supervision for Offline Safe Reinforcement Learning [56.150983204962735]
Boundary-to-Region (B2R) is a framework that enables asymmetric conditioning through cost signal realignment.<n>B2R redefines CTG as a boundary constraint under a fixed safety budget, unifying the cost distribution of all feasible trajectories.<n> Experimental results show that B2R satisfies safety constraints in 35 out of 38 safety-critical tasks.
arXiv Detail & Related papers (2025-09-30T03:38:20Z) - Incentivizing Safer Actions in Policy Optimization for Constrained Reinforcement Learning [9.62939764063531]
Constrained Reinforcement Learning aims to maximize the return while adhering to predefined constraint limits.<n>In continuous control settings, balancing the trade-off between reward and constraint satisfaction remains a significant challenge.<n>We introduce a novel approach that integrates an adaptive incentive mechanism in addition to the reward structure to stay within the constraint bound.
arXiv Detail & Related papers (2025-09-11T07:33:35Z) - Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints [52.37099916582462]
In Constrained Reinforcement Learning (CRL), agents explore the environment to learn the optimal policy while satisfying constraints.
We propose a theoretically guaranteed penalty function method, Exterior Penalty Policy Optimization (EPO), with adaptive penalties generated by a Penalty Metric Network (PMN)
PMN responds appropriately to varying degrees of constraint violations, enabling efficient constraint satisfaction and safe exploration.
arXiv Detail & Related papers (2024-07-22T10:57:32Z) - Do No Harm: A Counterfactual Approach to Safe Reinforcement Learning [5.862025534776996]
Reinforcement Learning for control has become increasingly popular due to its ability to learn rich feedback policies that take into account uncertainty and complex representations of the environment.
In such methods, if agents are in, or must visit, states where constraint violation might be inevitable, it is unclear how much they should be penalized.
We address this challenge by formulating a constraint on the counterfactual harm of the learned policy compared to a default, safe policy.
In a philosophical sense this formulation only penalizes the learner for constraint violations that it caused; in a practical sense it maintains feasibility of the optimal control problem.
arXiv Detail & Related papers (2024-05-19T20:33:21Z) - Uniformly Safe RL with Objective Suppression for Multi-Constraint Safety-Critical Applications [73.58451824894568]
The widely adopted CMDP model constrains the risks in expectation, which makes room for dangerous behaviors in long-tail states.
In safety-critical domains, such behaviors could lead to disastrous outcomes.
We propose Objective Suppression, a novel method that adaptively suppresses the task reward maximizing objectives according to a safety critic.
arXiv Detail & Related papers (2024-02-23T23:22:06Z) - Handling Long and Richly Constrained Tasks through Constrained
Hierarchical Reinforcement Learning [20.280636126917614]
Safety in goal directed Reinforcement Learning (RL) settings has typically been handled through constraints over trajectories.
We propose a (safety) Constrained Search with Hierarchical Reinforcement Learning (CoSHRL) mechanism that combines an upper level constrained search agent with a low-level goal conditioned RL agent.
A major advantage of CoSHRL is that it can handle constraints on the cost value distribution and can adjust to flexible constraint thresholds without retraining.
arXiv Detail & Related papers (2023-02-21T12:57:12Z) - Safety Correction from Baseline: Towards the Risk-aware Policy in
Robotics via Dual-agent Reinforcement Learning [64.11013095004786]
We propose a dual-agent safe reinforcement learning strategy consisting of a baseline and a safe agent.
Such a decoupled framework enables high flexibility, data efficiency and risk-awareness for RL-based control.
The proposed method outperforms the state-of-the-art safe RL algorithms on difficult robot locomotion and manipulation tasks.
arXiv Detail & Related papers (2022-12-14T03:11:25Z) - Penalized Proximal Policy Optimization for Safe Reinforcement Learning [68.86485583981866]
We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem.
P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective.
We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
arXiv Detail & Related papers (2022-05-24T06:15:51Z) - Lyapunov Barrier Policy Optimization [15.364174084072872]
We propose a new method, LBPO, that uses a Lyapunov-based barrier function to restrict the policy update to a safe set for each training iteration.
Our method also allows the user to control the conservativeness of the agent with respect to the constraints in the environment.
arXiv Detail & Related papers (2021-03-16T17:58:27Z) - Constrained Markov Decision Processes via Backward Value Functions [43.649330976089004]
We model the problem of learning with constraints as a Constrained Markov Decision Process.
A key contribution of our approach is to translate cumulative cost constraints into state-based constraints.
We provide theoretical guarantees under which the agent converges while ensuring safety over the course of training.
arXiv Detail & Related papers (2020-08-26T20:56:16Z) - Guided Constrained Policy Optimization for Dynamic Quadrupedal Robot
Locomotion [78.46388769788405]
We introduce guided constrained policy optimization (GCPO), an RL framework based upon our implementation of constrained policy optimization (CPPO)
We show that guided constrained RL offers faster convergence close to the desired optimum resulting in an optimal, yet physically feasible, robotic control behavior without the need for precise reward function tuning.
arXiv Detail & Related papers (2020-02-22T10:15:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.