Leveraging Robust Optimization for LLM Alignment under Distribution Shifts
- URL: http://arxiv.org/abs/2504.05831v3
- Date: Tue, 20 May 2025 06:42:37 GMT
- Title: Leveraging Robust Optimization for LLM Alignment under Distribution Shifts
- Authors: Mingye Zhu, Yi Liu, Zheren Fu, Yongdong Zhang, Zhendong Mao,
- Abstract summary: Preference alignment methods are increasingly critical for steering large language models to generate outputs consistent with human values.<n>We propose a novel distribution-aware optimization framework that improves preference alignment despite such shifts.
- Score: 52.983390470606146
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Preference alignment methods are increasingly critical for steering large language models (LLMs) to generate outputs consistent with human values. While recent approaches often rely on synthetic data generated by LLMs for scalability and cost-efficiency reasons, this reliance can introduce distribution shifts that undermine the nuanced representation of human preferences needed for desirable outputs. In this paper, we propose a novel distribution-aware optimization framework that improves preference alignment despite such shifts. Our approach first leverages well-learned classifiers to assign a calibration value to each training sample, quantifying its alignment with the target human-preferred distribution. These values are then incorporated into a robust optimization objective that minimizes the worst-case loss over regions of the data space most relevant to human preferences. By explicitly focusing optimization on the target distribution, our approach mitigates the impact of distributional mismatch and improves the generation of responses that better reflect intended values.
Related papers
- Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences [13.588231827053923]
Direct Preference Optimization (DPO) aligns text-to-image (T2I) generation models with human preferences using pairwise preference data.<n>We propose SmPO-Diffusion, a novel method for modeling preference distributions to improve the DPO objective.<n>Our approach effectively mitigates issues of excessive optimization and objective misalignment present in existing methods.
arXiv Detail & Related papers (2025-06-03T09:47:22Z) - Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.<n>Current approaches typically address this issue through online sampling from the target policy.<n>We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z) - Direct Distributional Optimization for Provable Alignment of Diffusion Models [39.048284342436666]
We introduce a novel alignment method for diffusion models from distribution optimization perspectives.
We first formulate the problem as a generic regularized loss minimization over probability distributions.
We enable sampling from the learned distribution by approximating its score function via Doob's $h$-transform technique.
arXiv Detail & Related papers (2025-02-05T07:35:15Z) - Robust LLM Alignment via Distributionally Robust Direct Preference Optimization [15.328510632723505]
A major challenge in aligning large language models (LLMs) with human preferences is the issue of distribution shift.<n>We develop two novel distributionally robust direct preference optimization (DPO) algorithms, namely, Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO)<n>We demonstrate the superior performance of WDPO and KLDPO in substantially improving the alignment when there is a preference distribution shift.
arXiv Detail & Related papers (2025-02-04T02:03:19Z) - RosePO: Aligning LLM-based Recommenders with Human Values [38.029251417802044]
We propose a general framework -- Recommendation with smoothing personalized Preference Optimization (RosePO)
RosePO better aligns with customized human values during the post-training stage.
Evaluation on three real-world datasets demonstrates the effectiveness of our method.
arXiv Detail & Related papers (2024-10-16T12:54:34Z) - Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss.
The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z) - Geometric-Averaged Preference Optimization for Soft Preference Labels [78.2746007085333]
Many algorithms for aligning LLMs with human preferences assume that human preferences are binary and deterministic.
In this work, we introduce the distributional soft preference labels and improve Direct Preference Optimization (DPO) with a weighted geometric average of the LLM output likelihood in the loss function.
arXiv Detail & Related papers (2024-09-10T17:54:28Z) - Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment [72.99676237703099]
We propose a new framework that boosts the alignment of large language models with human preferences.
Our key idea is leveraging the human prior knowledge within the small (seed) data.
We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values.
We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO)
Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Negating Negatives: Alignment with Human Negative Samples via Distributional Dispreference Optimization [37.8788435790632]
Large language models (LLMs) have revolutionized the role of AI, yet pose potential social risks.
Existing methods rely on high-quality positive-negative training pairs, suffering from noisy positive responses that are barely distinguishable from negative ones.
We propose Distributional Dispreference Optimization (D$2$O), which maximizes the discrepancy between dispreferred responses and the generated non-negative ones.
arXiv Detail & Related papers (2024-03-06T03:02:38Z) - Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game [31.66896160733569]
We propose an Adversarial Preference Optimization (APO) framework to target more efficient human preference optimization.
We find the proposed adversarial training framework further enhances existing alignment baselines in terms of LLM helpfulness and harmlessness.
arXiv Detail & Related papers (2023-11-14T10:10:31Z) - FedLAP-DP: Federated Learning by Sharing Differentially Private Loss Approximations [53.268801169075836]
We propose FedLAP-DP, a novel privacy-preserving approach for federated learning.
A formal privacy analysis demonstrates that FedLAP-DP incurs the same privacy costs as typical gradient-sharing schemes.
Our approach presents a faster convergence speed compared to typical gradient-sharing methods.
arXiv Detail & Related papers (2023-02-02T12:56:46Z) - KL Guided Domain Adaptation [88.19298405363452]
Domain adaptation is an important problem and often needed for real-world applications.
A common approach in the domain adaptation literature is to learn a representation of the input that has the same distributions over the source and the target domain.
We show that with a probabilistic representation network, the KL term can be estimated efficiently via minibatch samples.
arXiv Detail & Related papers (2021-06-14T22:24:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.