Related papers: ROPO: Robust Preference Optimization for Large Language Models

ROPO: Robust Preference Optimization for Large Language Models

URL: http://arxiv.org/abs/2404.04102v2
Date: Tue, 28 May 2024 17:11:53 GMT
Title: ROPO: Robust Preference Optimization for Large Language Models
Authors: Xize Liang, Chao Chen, Shuang Qiu, Jie Wang, Yue Wu, Zhihang Fu, Zhihao Shi, Feng Wu, Jieping Ye,
Abstract summary: We propose an iterative alignment approach that integrates noise-tolerance and filtering of noisy samples without the aid of external models. Experiments on three widely-used datasets with Mistral-7B and Llama-2-7B demonstrate that ROPO significantly outperforms existing preference alignment methods.
Score: 59.10763211091664
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Preference alignment is pivotal for empowering large language models (LLMs) to generate helpful and harmless responses. However, the performance of preference alignment is highly sensitive to the prevalent noise in the preference data. Recent efforts for this problem either marginally alleviate the impact of noise without the ability to actually reduce its presence, or rely on costly teacher LLMs prone to reward misgeneralization. To address these challenges, we propose the RObust Preference Optimization (ROPO) framework, an iterative alignment approach that integrates noise-tolerance and filtering of noisy samples without the aid of external models. Specifically, ROPO iteratively solves a constrained optimization problem, where we dynamically assign a quality-aware weight for each sample and constrain the sum of the weights to the number of samples we intend to retain. For noise-tolerant training and effective noise identification, we derive a robust loss by suppressing the gradients of samples with high uncertainty. We demonstrate both empirically and theoretically that the derived loss is critical for distinguishing noisy samples from clean ones. Furthermore, inspired by our derived loss, we propose a robustness-guided rejection sampling technique to compensate for the potential important information in discarded queries. Experiments on three widely-used datasets with Mistral-7B and Llama-2-7B demonstrate that ROPO significantly outperforms existing preference alignment methods, with its superiority growing as the noise rate increases.

Related papers

Reverse Preference Optimization for Complex Instruction Following [61.39734201711077]
We propose a simple yet effective method called Reverse Preference Optimization (RPO)<n>It mitigates noise in preference pairs by dynamically reversing the constraints within the instruction to ensure the chosen response is perfect.<n>RPO scales effectively across model sizes, with the 70B RPO model surpassing GPT-4o.
arXiv Detail & Related papers (2025-05-28T09:44:27Z)
Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization [31.741110625305186]
We propose using the paradigm of preference optimization to solve the modality bias problem. Specifically, we first construct the dataset by introducing perturbations to reduce the informational content of certain modalities. To address the inevitable noise in automatically constructed data, we combine the noise robust Mean Absolute Error with the Binary Cross Entropy in Direct Preference Optimization.
arXiv Detail & Related papers (2025-03-23T04:00:11Z)
One Goal, Many Challenges: Robust Preference Optimization Amid Content-Aware and Multi-Source Noise [0.0]
This paper introduces Content-Aware Noise-Resilient Preference Optimization (CNRPO), a novel framework that addresses multiple sources of content-dependent noise in preference learning. We leverage backdoor attack mechanisms to efficiently learn and control various noise sources within a single model.
arXiv Detail & Related papers (2025-03-16T00:22:00Z)
Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data [51.62162460809116]
We introduce Dynamic Noise Preference Optimization (DNPO) to ensure consistent improvements across iterations. In experiments with Zephyr-7B, DNPO consistently outperforms existing methods, showing an average performance boost of 2.6%. DNPO shows a significant improvement in model-generated data quality, with a 29.4% win-loss rate gap compared to the baseline in GPT-4 evaluations.
arXiv Detail & Related papers (2025-02-08T01:20:09Z)
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z)
Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization [9.618391485742968]
Iterative preference optimization has recently become one of the de-facto training paradigms for large language models (LLMs) We present an uncertainty-enhanced textbfPreference textbfOptimization framework to make the LLM self-evolve with reliable feedback. Our framework substantially alleviates the noisy problem and improves the performance of iterative preference optimization.
arXiv Detail & Related papers (2024-09-17T14:05:58Z)
Large Language Model Enhanced Hard Sample Identification for Denoising Recommendation [4.297249011611168]
Implicit feedback is often used to build recommender systems. Previous studies have attempted to alleviate this by identifying noisy samples based on their diverged patterns. We propose a Large Language Model Enhanced Hard Sample Denoising framework.
arXiv Detail & Related papers (2024-09-16T14:57:09Z)
Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization [45.6430987775264]
This study addresses the challenge of noise in training datasets for Direct Preference Optimization (DPO) We categorize noise into pointwise noise, which includes low-quality data points, and pairwise noise, which encompasses erroneous data pair associations. We introduce Distributionally Robustifying DPO, which integrates pairwise robustness by optimizing against worst-case pairwise scenarios.
arXiv Detail & Related papers (2024-07-10T17:48:25Z)
Double Correction Framework for Denoising Recommendation [45.98207284259792]
In implicit feedback, noisy samples can affect precise user preference learning. A popular solution is based on dropping noisy samples in the model training phase. We propose a Double Correction Framework for Denoising Recommendation.
arXiv Detail & Related papers (2024-05-18T12:15:10Z)
Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs) GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations. We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z)
Improve Noise Tolerance of Robust Loss via Noise-Awareness [60.34670515595074]
We propose a meta-learning method which is capable of adaptively learning a hyper parameter prediction function, called Noise-Aware-Robust-Loss-Adjuster (NARL-Adjuster for brevity) Four SOTA robust loss functions are attempted to be integrated with our algorithm, and comprehensive experiments substantiate the general availability and effectiveness of the proposed method in both its noise tolerance and performance.
arXiv Detail & Related papers (2023-01-18T04:54:58Z)
Square Root Principal Component Pursuit: Tuning-Free Noisy Robust Matrix Recovery [8.581512812219737]
We propose a new framework for low-rank matrix recovery from observations corrupted with noise and outliers. Inspired by the square root Lasso, this new formulation does not require prior knowledge of the noise level. We show that a single, universal choice of the regularization parameter suffices to achieve reconstruction error proportional to the (a priori unknown) noise level.
arXiv Detail & Related papers (2021-06-17T02:28:11Z)
RDP-GAN: A R\'enyi-Differential Privacy based Generative Adversarial Network [75.81653258081435]
Generative adversarial network (GAN) has attracted increasing attention recently owing to its impressive ability to generate realistic samples with high privacy protection. However, when GANs are applied on sensitive or private training examples, such as medical or financial records, it is still probable to divulge individuals' sensitive and private information. We propose a R'enyi-differentially private-GAN (RDP-GAN), which achieves differential privacy (DP) in a GAN by carefully adding random noises on the value of the loss function during training.
arXiv Detail & Related papers (2020-07-04T09:51:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.