Related papers: Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

URL: http://arxiv.org/abs/2510.03520v1
Date: Fri, 03 Oct 2025 21:24:41 GMT
Title: Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models
Authors: Kartik Pandit, Sourav Ganguly, Arnesh Banerjee, Shaahin Angizi, Arnob Ghosh,
Abstract summary: We introduce Certifiable Safe-RLHF, a cost model trained on a large-scale corpus to assign semantically grounded safety scores.<n>With an appropriately scaled penalty, feasibility of the safety constraints can be guaranteed, eliminating the need for dual-variable updates.<n> Empirical evaluation demonstrates that CS-RLHF outperforms state-of-the-art model responses rendering at-least 5 times efficient against nominal and jail-breaking prompts.
Score: 7.422627253922975
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Ensuring safety is a foundational requirement for large language models (LLMs). Achieving an appropriate balance between enhancing the utility of model outputs and mitigating their potential for harm is a complex and persistent challenge. Contemporary approaches frequently formalize this problem within the framework of Constrained Markov Decision Processes (CMDPs) and employ established CMDP optimization techniques. However, these methods exhibit two notable limitations. First, their reliance on reward and cost functions renders performance highly sensitive to the underlying scoring mechanism, which must capture semantic meaning rather than being triggered by superficial keywords. Second, CMDP-based training entails tuning dual-variable, a process that is both computationally expensive and does not provide any provable safety guarantee for a fixed dual variable that can be exploitable through adversarial jailbreaks. To overcome these limitations, we introduce Certifiable Safe-RLHF (CS-RLHF) that introduces a cost model trained on a large-scale corpus to assign semantically grounded safety scores. In contrast to the lagrangian-based approach, CS-RLHF adopts a rectified penalty-based formulation. This design draws on the theory of exact penalty functions in constrained optimization, wherein constraint satisfaction is enforced directly through a suitably chosen penalty term. With an appropriately scaled penalty, feasibility of the safety constraints can be guaranteed at the optimizer, eliminating the need for dual-variable updates. Empirical evaluation demonstrates that CS-RLHF outperforms state-of-the-art LLM model responses rendering at-least 5 times efficient against nominal and jail-breaking prompts

Related papers

BarrierSteer: LLM Safety via Learning Barrier Steering [83.12893815611052]
BarrierSteer is a novel framework that formalizes safety by embedding learned non-linear safety constraints directly into the model's latent representation space.<n>We show that BarrierSteer substantially reduces adversarial success rates, decreases unsafe generations, and outperforms existing methods.
arXiv Detail & Related papers (2026-02-23T18:19:46Z)
SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization [0.0]
We introduce SLIME, a reference-free alignment objective designed to decouple preference learning from generation quality.<n>Our results demonstrate that SLIME achieves superior performance compared to state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-02T17:46:06Z)
Boundary-to-Region Supervision for Offline Safe Reinforcement Learning [56.150983204962735]
Boundary-to-Region (B2R) is a framework that enables asymmetric conditioning through cost signal realignment.<n>B2R redefines CTG as a boundary constraint under a fixed safety budget, unifying the cost distribution of all feasible trajectories.<n> Experimental results show that B2R satisfies safety constraints in 35 out of 38 safety-critical tasks.
arXiv Detail & Related papers (2025-09-30T03:38:20Z)
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z)
Safe Deep Reinforcement Learning for Resource Allocation with Peak Age of Information Violation Guarantees [10.177917426690701]
This paper presents a novel optimization theory-based safe deep reinforcement learning (DRL) framework for ultra-reliable Wireless Networked Control Systems (WNCSs)<n>The framework minimizes power consumption under key constraints, including Peak Age of Information (PAoI) violation probability, transmit power, and schedulability in the finite blocklength regime.<n>The proposed framework outperforms rule-based and other optimization theory based DRL benchmarks, achieving faster convergence, higher rewards, and greater stability.
arXiv Detail & Related papers (2025-07-11T14:57:37Z)
Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
From Uncertain to Safe: Conformal Fine-Tuning of Diffusion Models for Safe PDE Control [16.249515106834355]
Deep learning for partial differential equation (PDE)-constrained control is gaining increasing attention.<n>We propose Safe Diffusion Models for PDE Control (SafeDiffCon) to achieve optimal control under safety constraints.<n>We evaluate SafeDiffCon on three control tasks: 1D Burgers' equation, 2D incompressible fluid, and controlled nuclear fusion problem.
arXiv Detail & Related papers (2025-02-04T10:42:30Z)
One-Shot Safety Alignment for Large Language Models via Optimal Dualization [64.52223677468861]
This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem. We do so by pre-optimizing a smooth and convex dual function that has a closed form. Our strategy leads to two practical algorithms in model-based and preference-based settings.
arXiv Detail & Related papers (2024-05-29T22:12:52Z)
SaFormer: A Conditional Sequence Modeling Approach to Offline Safe Reinforcement Learning [64.33956692265419]
offline safe RL is of great practical relevance for deploying agents in real-world applications. We present a novel offline safe RL approach referred to as SaFormer.
arXiv Detail & Related papers (2023-01-28T13:57:01Z)
Safe Wasserstein Constrained Deep Q-Learning [2.088376060651494]
This paper presents a distributionally robust Q-Learning algorithm (DrQ) which leverages Wasserstein ambiguity sets to provide idealistic probabilistic out-of-sample safety guarantees. Using a case study of lithium-ion battery fast charging, we explore how idealistic safety guarantees translate to generally improved safety.
arXiv Detail & Related papers (2020-02-07T21:23:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.