What Does Preference Learning Recover from Pairwise Comparison Data?
- URL: http://arxiv.org/abs/2602.10286v1
- Date: Tue, 10 Feb 2026 20:59:55 GMT
- Title: What Does Preference Learning Recover from Pairwise Comparison Data?
- Authors: Rattana Pukdee, Maria-Florina Balcan, Pradeep Ravikumar,
- Abstract summary: Pairwise preference learning is central to machine learning, with recent applications in aligning language models with human preferences.<n>A typical dataset consists of triplets $(x, y+, y-)$, where response $y+$ is preferred over response $y-$ for context.<n>We formalize the preference information it encodes through the conditional preference distribution (CPRD)<n>These results offer a data-centric foundation for understanding what preference learning actually recovers.
- Score: 39.21477667267936
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pairwise preference learning is central to machine learning, with recent applications in aligning language models with human preferences. A typical dataset consists of triplets $(x, y^+, y^-)$, where response $y^+$ is preferred over response $y^-$ for context $x$. The Bradley--Terry (BT) model is the predominant approach, modeling preference probabilities as a function of latent score differences. Standard practice assumes data follows this model and learns the latent scores accordingly. However, real data may violate this assumption, and it remains unclear what BT learning recovers in such cases. Starting from triplet comparison data, we formalize the preference information it encodes through the conditional preference distribution (CPRD). We give precise conditions for when BT is appropriate for modeling the CPRD, and identify factors governing sample efficiency -- namely, margin and connectivity. Together, these results offer a data-centric foundation for understanding what preference learning actually recovers.
Related papers
- Data Selection for ERMs [67.57726352698933]
We study how well can $mathcalA$ perform when trained on at most $nll N$ data points selected from a population of $N$ points.<n>Our results include optimal data-selection bounds for mean estimation, linear classification, and linear regression.
arXiv Detail & Related papers (2025-04-20T11:26:01Z) - DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets.<n>Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining.<n>Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z) - Best Policy Learning from Trajectory Preference Feedback [11.896067099790962]
Preference-based Reinforcement Learning (PbRL) offers a more robust alternative.<n>We study the best policy identification problem in PbRL, motivated by post-training optimization of generative models.<n>We propose Posterior Sampling for Preference Learning ($mathsfPSPL$), a novel algorithm inspired by Top-Two Thompson Sampling.
arXiv Detail & Related papers (2025-01-31T03:55:10Z) - Reward Modeling with Ordinal Feedback: Wisdom of the Crowd [9.034189257088762]
Learning a reward model (RM) from human preferences has been an important component in aligning large language models.
We propose a framework for learning RMs under ordinal feedback.
We prove the statistical benefits of ordinal feedback in terms of reducing the Rademacher complexity.
arXiv Detail & Related papers (2024-11-19T20:17:04Z) - Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment [51.14207112118503]
We introduce preference embedding, an approach that embeds responses into a latent space to capture preferences efficiently.<n>We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback.<n>Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z) - Robust Reinforcement Learning from Corrupted Human Feedback [86.17030012828003]
Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data.
We propose a robust RLHF approach -- $R3M$, which models the potentially corrupted preference label as sparse outliers.
Our experiments on robotic control and natural language generation with large language models (LLMs) show that $R3M$ improves robustness of the reward against several types of perturbations to the preference data.
arXiv Detail & Related papers (2024-06-21T18:06:30Z) - Preference Learning Algorithms Do Not Learn Preference Rankings [62.335733662381884]
We study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs.
We find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets.
arXiv Detail & Related papers (2024-05-29T21:29:44Z) - CURATRON: Complete and Robust Preference Data for Rigorous Alignment of Large Language Models [4.9125631769031415]
This paper addresses the challenges of aligning large language models with human values via preference learning.<n>We propose a novel method for robustly and maliciously manipulated AI pipeline datasets to enhance LLMs' resilience.
arXiv Detail & Related papers (2024-03-05T07:58:12Z) - From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition [64.59093444558549]
We propose a simple, easy-to-implement, two-step training pipeline that we call From Fake to Real.
By training on real and synthetic data separately, FFR does not expose the model to the statistical differences between real and synthetic data.
Our experiments show that FFR improves worst group accuracy over the state-of-the-art by up to 20% over three datasets.
arXiv Detail & Related papers (2023-08-08T19:52:28Z) - Model-based Offline Imitation Learning with Non-expert Data [7.615595533111191]
We propose a scalable model-based offline imitation learning algorithmic framework that leverages datasets collected by both suboptimal and optimal policies.
We show that the proposed method textitalways outperforms Behavioral Cloning in the low data regime on simulated continuous control domains.
arXiv Detail & Related papers (2022-06-11T13:08:08Z) - Uncertainty Estimation for Language Reward Models [5.33024001730262]
Language models can learn a range of capabilities from unsupervised training on text corpora.
It is often easier for humans to choose between options than to provide labeled data, and prior work has achieved state-of-the-art performance by training a reward model from such preference comparisons.
We seek to address these problems via uncertainty estimation, which can improve sample efficiency and robustness using active learning and risk-averse reinforcement learning.
arXiv Detail & Related papers (2022-03-14T20:13:21Z) - Towards Robustifying NLI Models Against Lexical Dataset Biases [94.79704960296108]
This paper explores both data-level and model-level debiasing methods to robustify models against lexical dataset biases.
First, we debias the dataset through data augmentation and enhancement, but show that the model bias cannot be fully removed via this method.
The second approach employs a bag-of-words sub-model to capture the features that are likely to exploit the bias and prevents the original model from learning these biased features.
arXiv Detail & Related papers (2020-05-10T17:56:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.