Constrained Group Relative Policy Optimization
- URL: http://arxiv.org/abs/2602.05863v2
- Date: Fri, 06 Feb 2026 18:22:59 GMT
- Title: Constrained Group Relative Policy Optimization
- Authors: Roger Girgis, Rodrigue de Schaetzen, Luke Rowe, Azalée Robitaille, Christopher Pal, Liam Paull,
- Abstract summary: We introduce Constrained GRPO, a Lagrangian-based extension of GRPO for constrained policy optimization.<n>We show that a naive multi-component treatment in advantage estimation can break constrained learning.<n>We also evaluate Constrained GRPO on robotics tasks, where it improves constraint satisfaction while increasing task success.
- Score: 18.3888203751956
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Group Relative Policy Optimization (GRPO) has emerged as a scalable framework for critic-free policy learning, extending it to settings with explicit behavioral constraints remains underexplored. We introduce Constrained GRPO, a Lagrangian-based extension of GRPO for constrained policy optimization. Constraints are specified via indicator cost functions, enabling direct optimization of violation rates through a Lagrangian relaxation. We show that a naive multi-component treatment in advantage estimation can break constrained learning: mismatched component-wise standard deviations distort the relative importance of the different objective terms, which in turn corrupts the Lagrangian signal and prevents meaningful constraint enforcement. We formally derive this effect to motivate our scalarized advantage construction that preserves the intended trade-off between reward and constraint terms. Experiments in a toy gridworld confirm the predicted optimization pathology and demonstrate that scalarizing advantages restores stable constraint control. In addition, we evaluate Constrained GRPO on robotics tasks, where it improves constraint satisfaction while increasing task success, establishing a simple and effective recipe for constrained policy optimization in embodied AI domains that increasingly rely on large multimodal foundation models.
Related papers
- Constraint-Aware Generative Auto-bidding via Pareto-Prioritized Regret Optimization [8.514099612407062]
PRO-Bid is a constraint-aware generative auto-bidding framework based on two synergistic mechanisms.<n>It achieves superior constraint satisfaction and value acquisition compared to state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-09T04:41:30Z) - Automatic Constraint Policy Optimization based on Continuous Constraint Interpolation Framework for Offline Reinforcement Learning [2.0719232729184145]
offline Reinforcement Learning (RL) relies on policy constraints to shape performance.<n>Most existing methods commit to a single constraint family.<n>We propose Continuous Constraint Interpolation (CCI), a unified optimization framework.
arXiv Detail & Related papers (2026-01-30T14:21:41Z) - MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z) - Incentivizing Safer Actions in Policy Optimization for Constrained Reinforcement Learning [9.62939764063531]
Constrained Reinforcement Learning aims to maximize the return while adhering to predefined constraint limits.<n>In continuous control settings, balancing the trade-off between reward and constraint satisfaction remains a significant challenge.<n>We introduce a novel approach that integrates an adaptive incentive mechanism in addition to the reward structure to stay within the constraint bound.
arXiv Detail & Related papers (2025-09-11T07:33:35Z) - Proactive Constrained Policy Optimization with Preemptive Penalty [11.93135424276656]
We propose a novel preemptive penalty mechanism for constrained policy optimization.<n>This mechanism integrates barrier items into the objective function as the policy nears the boundary, imposing a cost.<n>We also introduce a constraint-aware intrinsic reward to guide boundary-aware exploration, which is activated only when the policy approaches the constraint boundary.
arXiv Detail & Related papers (2025-08-03T18:35:55Z) - Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints [52.37099916582462]
In Constrained Reinforcement Learning (CRL), agents explore the environment to learn the optimal policy while satisfying constraints.
We propose a theoretically guaranteed penalty function method, Exterior Penalty Policy Optimization (EPO), with adaptive penalties generated by a Penalty Metric Network (PMN)
PMN responds appropriately to varying degrees of constraint violations, enabling efficient constraint satisfaction and safe exploration.
arXiv Detail & Related papers (2024-07-22T10:57:32Z) - Conflict-Averse Gradient Aggregation for Constrained Multi-Objective Reinforcement Learning [13.245000585002858]
In many real-world applications, a reinforcement learning (RL) agent should consider multiple objectives and adhere to safety guidelines.
We propose a constrained multi-objective gradient aggregation algorithm named Constrained Multi-Objective Gradient Aggregator (CoGAMO)
arXiv Detail & Related papers (2024-03-01T04:57:13Z) - Resilient Constrained Reinforcement Learning [87.4374430686956]
We study a class of constrained reinforcement learning (RL) problems in which multiple constraint specifications are not identified before study.
It is challenging to identify appropriate constraint specifications due to the undefined trade-off between the reward training objective and the constraint satisfaction.
We propose a new constrained RL approach that searches for policy and constraint specifications together.
arXiv Detail & Related papers (2023-12-28T18:28:23Z) - Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee.
We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z) - Penalized Proximal Policy Optimization for Safe Reinforcement Learning [68.86485583981866]
We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem.
P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective.
We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
arXiv Detail & Related papers (2022-05-24T06:15:51Z) - Shortest-Path Constrained Reinforcement Learning for Sparse Reward Tasks [59.419152768018506]
We show that any optimal policy necessarily satisfies the k-SP constraint.
We propose a novel cost function that penalizes the policy violating SP constraint, instead of completely excluding it.
Our experiments on MiniGrid, DeepMind Lab, Atari, and Fetch show that the proposed method significantly improves proximal policy optimization (PPO)
arXiv Detail & Related papers (2021-07-13T21:39:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.