Related papers: Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling

Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling

URL: http://arxiv.org/abs/2503.10093v2
Date: Sun, 15 Jun 2025 11:39:41 GMT
Title: Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling
Authors: Qiyuan Deng, Xuefeng Bai, Kehai Chen, Yaowei Wang, Liqiang Nie, Min Zhang,
Abstract summary: Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.<n>Current approaches typically address this issue through online sampling from the target policy.<n>We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
Score: 84.00480999255628
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning (RL) algorithms for safety alignment of Large Language Models (LLMs), such as Direct Preference Optimization (DPO), encounter the challenge of distribution shift. Current approaches typically address this issue through online sampling from the target policy, which requires significant computational resources. In this paper, we hypothesize that during off-policy training, while the ranking order of output generated by policy changes, their overall distribution remains relatively stable. This stability allows the conversion of the sampling process from the target policy into a computationally efficient re-ranking of preference data. Building on this hypothesis, we propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals, which are then used to calculate label confidence for preference reordering. Extensive experiments and theoretical analysis demonstrate that the proposed method effectively addresses the distribution shift issue, remarkably enhancing the safety performance while avoiding about 300x computational overheads.

Related papers

Preference Optimization for Combinatorial Optimization Problems [54.87466279363487]
Reinforcement Learning (RL) has emerged as a powerful tool for neural optimization, enabling models learns that solve complex problems without requiring expert knowledge.<n>Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast action spaces.<n>We propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling.
arXiv Detail & Related papers (2025-05-13T16:47:00Z)
Leveraging Robust Optimization for LLM Alignment under Distribution Shifts [54.654823811482665]
Large language models (LLMs) increasingly rely on preference alignment methods to steer outputs toward human values. Recent approaches have turned to synthetic data generated by LLMs as a scalable alternative. We propose a novel distribution-aware optimization framework that improves preference alignment in the presence of such shifts.
arXiv Detail & Related papers (2025-04-08T09:14:38Z)
Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization [9.618391485742968]
Iterative preference optimization has recently become one of the de-facto training paradigms for large language models (LLMs) We present an uncertainty-enhanced textbfPreference textbfOptimization framework to make the LLM self-evolve with reliable feedback. Our framework substantially alleviates the noisy problem and improves the performance of iterative preference optimization.
arXiv Detail & Related papers (2024-09-17T14:05:58Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
Generalization Bounds of Surrogate Policies for Combinatorial Optimization Problems [61.580419063416734]
A recent stream of structured learning approaches has improved the practical state of the art for a range of optimization problems. The key idea is to exploit the statistical distribution over instances instead of dealing with instances separately. In this article, we investigate methods that smooth the risk by perturbing the policy, which eases optimization and improves the generalization error.
arXiv Detail & Related papers (2024-07-24T12:00:30Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies. Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z)
PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback [106.63518036538163]
We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning. Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable. Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
arXiv Detail & Related papers (2023-08-03T18:03:44Z)
Safe Policy Improvement in Constrained Markov Decision Processes [10.518340300810504]
We present a solution to the synthesis problem by solving its two main challenges: reward-shaping from a set of formal requirements and safe policy update. For the former, we propose an automatic reward-shaping procedure, defining a scalar reward signal compliant with the task specification. For the latter, we introduce an algorithm ensuring that the policy is improved in a safe fashion with high-confidence guarantees.
arXiv Detail & Related papers (2022-10-20T13:29:32Z)
Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm [4.128216503196621]
We propose an On-policy Model-based Safe Deep RL algorithm in which we learn the transition dynamics of the environment in an online manner. We show that our algorithm is more sample efficient and results in lower cumulative hazard violations as compared to constrained model-free approaches.
arXiv Detail & Related papers (2022-10-14T06:53:02Z)
Reinforcement Learning in the Wild: Scalable RL Dispatching Algorithm Deployed in Ridehailing Marketplace [12.298997392937876]
This study proposes a real-time dispatching algorithm based on reinforcement learning. It is deployed online in multiple cities under DiDi's operation for A/B testing and is launched in one of the major international markets. The deployed algorithm shows over 1.3% improvement in total driver income from A/B testing.
arXiv Detail & Related papers (2022-02-10T16:07:17Z)
Mixed Reinforcement Learning with Additive Stochastic Uncertainty [19.229447330293546]
Reinforcement learning (RL) methods often rely on massive exploration data to search optimal policies, and suffer from poor sampling efficiency. This paper presents a mixed RL algorithm by simultaneously using dual representations of environmental dynamics to search the optimal policy. The effectiveness of the mixed RL is demonstrated by a typical optimal control problem of non-affine nonlinear systems.
arXiv Detail & Related papers (2020-02-28T08:02:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.