Related papers: Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework

Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework

URL: http://arxiv.org/abs/2506.05619v2
Date: Sun, 05 Oct 2025 11:24:21 GMT
Title: Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework
Authors: Kihyun Kim, Jiawei Zhang, Asuman Ozdaglar, Pablo A. Parrilo,
Abstract summary: We develop a novel preference learning framework capable of aligning aggregate opinions and policies proportionally with the true population distribution of evaluator preferences.<n>We propose a soft-max relaxation method that smoothly trade-offs population-proportional alignment with the selection of the Condorcet winner.
Score: 7.065259679465175
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Conventional preference learning methods often prioritize opinions held more widely when aggregating preferences from multiple evaluators. This may result in policies that are biased in favor of some types of opinions or groups and susceptible to strategic manipulation. To address this issue, we develop a novel preference learning framework capable of aligning aggregate opinions and policies proportionally with the true population distribution of evaluator preferences. Grounded in social choice theory, our approach infers the feasible set of evaluator population distributions directly from pairwise comparison data. Using these estimates, the algorithm constructs a policy that satisfies foundational axioms from social choice theory, namely monotonicity and Pareto efficiency, as well as our newly-introduced axioms of population-proportional alignment and population-bounded manipulability. Moreover, we propose a soft-max relaxation method that smoothly trade-offs population-proportional alignment with the selection of the Condorcet winner (which beats all other options in pairwise comparisons). Finally, we validate the effectiveness and scalability of our approach through experiments on both tabular recommendation tasks and large language model alignment.

Related papers

How Sampling Shapes LLM Alignment: From One-Shot Optima to Iterative Dynamics [65.67654005892469]
We show that proper instance-dependent sampling can yield stronger ranking guarantees, while skewed on-policy sampling can induce excessive concentration under structured preferences.<n>We then analyze iterative alignment dynamics in which the learned policy feeds back into future sampling and reference policies.<n>Our theoretical insights extend to Direct Preference Optimization, indicating the phenomena we captured are common to a broader class of preference-alignment methods.
arXiv Detail & Related papers (2026-02-12T17:11:08Z)
From RLHF to Direct Alignment: A Theoretical Unification of Preference Learning for Large Language Models [0.7366405857677227]
This survey provides a textittheoretical unification of preference learning methods.<n>We formalize each axis with precise definitions and theorems.<n>We synthesize empirical findings across 50+ papers and provide a practitioner's decision guide for method selection.
arXiv Detail & Related papers (2026-01-03T08:33:26Z)
Pluralistic Off-policy Evaluation and Alignment [47.35585359400588]
We propose POPE, the first framework for offline pluralistic preference evaluation and alignment in LLMs.<n>POPE includes a unified reward function that combines a collaborative utility component derived from human preference signals and a diversity component inspired by entropy-based coverage measures.<n> Empirical results demonstrate that POPE efficiently enhances pluralistic response generation and maintains the models' general capabilities on downstream tasks.
arXiv Detail & Related papers (2025-09-15T01:57:49Z)
PB$^2$: Preference Space Exploration via Population-Based Methods in Preference-Based Reinforcement Learning [2.0373030742807545]
We identify and address this preference exploration problem through population-based methods.<n>We demonstrate that maintaining a diverse population of agents enables more comprehensive exploration of the preference landscape.<n>This diversity improves reward model learning by generating preference queries with clearly distinguishable behaviors.
arXiv Detail & Related papers (2025-06-16T17:51:33Z)
Alternates, Assemble! Selecting Optimal Alternates for Citizens' Assemblies [1.5624421399300306]
deliberative democracy centers on citizens' assemblies, where randomly selected people discuss policy questions.<n>We introduce an optimization framework for alternate selection.<n>Our approach estimates dropout probabilities using historical data and selects alternates to minimize expected misrepresentation.<n> Empirical evaluation using real-world data demonstrates that, compared to the status quo, our method significantly improves representation while requiring fewer alternates.
arXiv Detail & Related papers (2025-06-02T17:48:33Z)
No Preference Left Behind: Group Distributional Preference Optimization [46.98320272443297]
Group Distributional Preference Optimization (GDPO) is a novel framework that aligns language models with the distribution of preferences within a group.<n>GDPO calibrates a language model using statistical estimation of the group's belief distribution.<n>GDPO consistently reduces this alignment gap during training.
arXiv Detail & Related papers (2024-12-28T23:30:47Z)
VPO: Leveraging the Number of Votes in Preference Optimization [5.200545764106177]
We introduce a technique that leverages user voting data to better align with diverse subjective preferences. We develop the Vote-based Preference Optimization framework, which incorporates the number of votes on both sides to distinguish between controversial and obvious generation pairs.
arXiv Detail & Related papers (2024-10-30T10:39:34Z)
ComPO: Community Preferences for Language Model Personalization [122.54846260663922]
ComPO is a method to personalize preference optimization in language models. We collect and release ComPRed, a question answering dataset with community-level preferences from Reddit.
arXiv Detail & Related papers (2024-10-21T14:02:40Z)
Pareto-Optimal Learning from Preferences with Hidden Context [17.590330740964266]
We propose POPL, which enables pluralistic alignment by framing discrepant group preferences as objectives with potential trade-offs.<n>Our theoretical and empirical evaluations demonstrate that POPL surpasses baseline methods in learning sets of reward functions and policies.<n>We illustrate that POPL can also serve as a foundation for techniques optimizing specific notions of group fairness.
arXiv Detail & Related papers (2024-06-21T18:57:38Z)
Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric. We propose a single framework built on their equivalence in learning scenarios. Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z)
Be Aware of the Neighborhood Effect: Modeling Selection Bias under Interference [50.95521705711802]
Previous studies have focused on addressing selection bias to achieve unbiased learning of the prediction model. This paper formally formulates the neighborhood effect as an interference problem from the perspective of causal inference. We propose a novel ideal loss that can be used to deal with selection bias in the presence of neighborhood effect.
arXiv Detail & Related papers (2024-04-30T15:20:41Z)
MaxMin-RLHF: Alignment with Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.<n>We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.<n>Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z)
Provable Benefits of Policy Learning from Human Preferences in Contextual Bandit Problems [82.92678837778358]
preference-based methods have demonstrated substantial success in empirical applications such as InstructGPT. We show how human bias and uncertainty in feedback modelings can affect the theoretical guarantees of these approaches.
arXiv Detail & Related papers (2023-07-24T17:50:24Z)
Off-Policy Evaluation with Policy-Dependent Optimization Response [90.28758112893054]
We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response. We construct unbiased estimators for the policy-dependent estimand by a perturbation method. We provide a general algorithm for optimizing causal interventions.
arXiv Detail & Related papers (2022-02-25T20:25:37Z)
Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem. Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem. We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.