Related papers: Robust Preference Alignment via Directional Neighborhood Consensus

Robust Preference Alignment via Directional Neighborhood Consensus

URL: http://arxiv.org/abs/2510.20498v2
Date: Fri, 24 Oct 2025 02:19:56 GMT
Title: Robust Preference Alignment via Directional Neighborhood Consensus
Authors: Ruochen Mao, Yuling Shi, Xiaodong Gu, Jiaheng Wei,
Abstract summary: We introduce Robust Preference Selection (RPS), a post-hoc, training-free method by leveraging directional neighborhood consensus.<n>RPS samples multiple responses from a local neighborhood of related preferences to create a superior candidate pool.<n>Our work presents a practical, theoretically-grounded solution for enhancing the reliability of preference-aligned models.
Score: 13.313830197011983
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Aligning large language models with human preferences is critical for creating reliable and controllable AI systems. A human preference can be visualized as a high-dimensional vector where different directions represent trade-offs between desired attributes (e.g., helpfulness vs. verbosity). Yet, because the training data often reflects dominant, average preferences, LLMs tend to perform well on common requests but fall short in specific, individual needs. This mismatch creates a preference coverage gap. Existing methods often address this through costly retraining, which may not be generalized to the full spectrum of diverse preferences. This brittleness means that when a user's request reflects a nuanced preference deviating from the training data's central tendency, model performance can degrade unpredictably. To address this challenge, we introduce Robust Preference Selection (RPS), a post-hoc, training-free method by leveraging directional neighborhood consensus. Instead of forcing a model to generate a response from a single, highly specific preference, RPS samples multiple responses from a local neighborhood of related preferences to create a superior candidate pool. It then selects the response that best aligns with the user's original intent. We provide a theoretical framework showing our neighborhood generation strategy is provably superior to a strong baseline that also samples multiple candidates. Comprehensive experiments across three distinct alignment paradigms (DPA, DPO, and SFT) demonstrate that RPS consistently improves robustness against this baseline, achieving win rates of up to 69% on challenging preferences from under-represented regions of the space without any model retraining. Our work presents a practical, theoretically-grounded solution for enhancing the reliability of preference-aligned models.

Related papers

Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences [14.686788596611246]
Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values.<n>Recent alternatives such as Direct Preference Optimization (DPO) simplify this pipeline by directly optimizing on preferences.<n>We propose a theoretical and algorithmic framework for fairness and personalization for diverse users in generative model alignment.
arXiv Detail & Related papers (2025-10-17T15:00:40Z)
Multi-Metric Preference Alignment for Generative Speech Restoration [15.696247605348383]
We propose a multi-metric preference alignment strategy for generative models.<n>We observe consistent and significant performance gains across three diverse generative paradigms.<n>Our aligned models can serve as powerful ''data annotators'', generating high-quality pseudo-labels.
arXiv Detail & Related papers (2025-08-24T07:05:10Z)
Beyond Single: A Data Selection Principle for LLM Alignment via Fine-Grained Preference Signals [46.58760908162995]
We propose a novel, theoretically-grounded data selection principle for large language models.<n>We prove the optimality of this strategy by analyzing the loss bounds of the Direct Preference Optimization objective.<n>Our strategy achieves over 10% relative improvement against both the standard holistic preference and a stronger oracle.
arXiv Detail & Related papers (2025-08-11T05:43:02Z)
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment [80.04449725137177]
Direct Preference Optimization (DPO) has emerged as a simple, yet effective alternative to reinforcement learning from human feedback.<n>Our study reveals a striking, safety-specific phenomenon associated with DPO alignment.<n>Using solely self-generated responses for both chosen and rejected pairs significantly outperforms configurations that incorporate responses from stronger models.
arXiv Detail & Related papers (2025-04-03T00:36:40Z)
Rethinking Diverse Human Preference Learning through Principal Component Analysis [22.123631189289963]
We introduce Decomposed Reward Models (DRMs) for extracting diverse human preferences from binary comparisons.<n>DRMs represent preferences as vectors and analyze them using Principal Component Analysis (PCA)<n>DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training.
arXiv Detail & Related papers (2025-02-18T18:55:26Z)
Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications.<n> Ensuring their alignment with the diverse preferences of individual users has become a critical challenge.<n>We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z)
ComPO: Community Preferences for Language Model Personalization [122.54846260663922]
ComPO is a method to personalize preference optimization in language models. We collect and release ComPRed, a question answering dataset with community-level preferences from Reddit.
arXiv Detail & Related papers (2024-10-21T14:02:40Z)
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.<n>We show that our approach consistently boosts DPO by a considerable margin.<n>Our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
arXiv Detail & Related papers (2024-10-10T16:01:51Z)
Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment [51.14207112118503]
We introduce preference embedding, an approach that embeds responses into a latent space to capture preferences efficiently.<n>We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback.<n>Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z)
Direct Preference Optimization With Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences [14.686788596611246]
Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values.<n>Recent alternatives such as Direct Preference Optimization (DPO) simplify this pipeline by directly optimizing on preferences.<n>We propose a theoretical and algorithmic framework for fairness and personalization for diverse users in generative model alignment.
arXiv Detail & Related papers (2024-05-23T21:25:20Z)
Aligning Crowd Feedback via Distributional Preference Reward Modeling [28.754532173765686]
We propose the Distributional Preference Reward Model (DPRM) to align large language models with diverse human preferences. Our experiments show that DPRM significantly enhances the alignment of LLMs with population preference, yielding more accurate, unbiased, and contextually appropriate responses.
arXiv Detail & Related papers (2024-02-15T07:29:43Z)
MaxMin-RLHF: Alignment with Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.<n>We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.<n>Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z)
Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts. RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.