Related papers: Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

URL: http://arxiv.org/abs/2411.12843v1
Date: Tue, 19 Nov 2024 20:17:04 GMT
Title: Reward Modeling with Ordinal Feedback: Wisdom of the Crowd
Authors: Shang Liu, Yu Pan, Guanting Chen, Xiaocheng Li,
Abstract summary: Learning a reward model (RM) from human preferences has been an important component in aligning large language models. We propose a framework for learning RMs under ordinal feedback. We prove the statistical benefits of ordinal feedback in terms of reducing the Rademacher complexity.
Score: 9.034189257088762
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning a reward model (RM) from human preferences has been an important component in aligning large language models (LLMs). The canonical setup of learning RMs from pairwise preference data is rooted in the classic Bradley-Terry (BT) model that accepts binary feedback, i.e., the label being either Response 1 is better than Response 2, or the opposite. Such a setup inevitably discards potentially useful samples (such as "tied" between the two responses) and loses more fine-grained information (such as "slightly better"). In this paper, we propose a framework for learning RMs under ordinal feedback which generalizes the case of binary preference feedback to any arbitrary granularity. Specifically, we first identify a marginal unbiasedness condition, which generalizes the assumption of the BT model in the existing binary feedback setting. The condition validates itself via the sociological concept of the wisdom of the crowd. Under the condition, we develop a natural probability model for pairwise preference data under ordinal feedback and analyze its properties. We prove the statistical benefits of ordinal feedback in terms of reducing the Rademacher complexity compared to the case of binary feedback. The proposed learning objective and the theory also extend to hinge loss and direct policy optimization (DPO). In particular, the theoretical analysis may be of independent interest when applying to a seemingly unrelated problem of knowledge distillation to interpret the bias-variance trade-off therein. The framework also sheds light on writing guidance for human annotators. Our numerical experiments validate that fine-grained feedback leads to better reward learning for both in-distribution and out-of-distribution settings. Further experiments show that incorporating a certain proportion of samples with tied preference boosts RM learning.

Related papers

Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance [46.71732887299883]
Reward models (RMs) are essential in reinforcement learning from human feedback (RLHF) to align large language models with human values.<n>We introduce a novel information-theoretic debiasing method called textbfDebiasing via textbfInformation optimization for textbfRM (DIR)<n>With theoretical justification from information theory, DIR can handle more sophisticated types of biases with non-linear correlations.
arXiv Detail & Related papers (2025-12-29T13:39:41Z)
All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning [49.43901716932925]
We show that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure.<n>Specifically, one first trains a reward model (RM) on some dataset (e.g., human preferences) before using it to provide online feedback.<n>We find the most support for the explanation that on problems with a generation-verification gap, it is relatively easy to learn the relatively simple RM from the preference data.
arXiv Detail & Related papers (2025-03-03T00:15:19Z)
Best Policy Learning from Trajectory Preference Feedback [15.799929216215672]
We address the problem of best policy identification in preference-based reinforcement learning (PbRL) We propose Posterior Sampling for Preference Learning ($mathsfPSPL$), a novel algorithm inspired by Top-Two Thompson Sampling. We provide the first theoretical guarantees for PbRL in this setting, establishing an upper bound on the simple Bayesian regret.
arXiv Detail & Related papers (2025-01-31T03:55:10Z)
A Systematic Examination of Preference Learning through the Lens of Instruction-Following [83.71180850955679]
We use a novel synthetic data generation pipeline to generate 48,000 instruction unique-following prompts. With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS) Experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements. High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance.
arXiv Detail & Related papers (2024-12-18T15:38:39Z)
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset. We show that our approach consistently boosts DPO by a considerable margin. Our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
arXiv Detail & Related papers (2024-10-10T16:01:51Z)
General Preference Modeling with Preference Representations for Aligning Language Models [51.14207112118503]
We introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently. We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z)
Estimating Treatment Effects under Recommender Interference: A Structured Neural Networks Approach [32.55211271796683]
We show that commonly adopted difference-in-means estimators can lead to severely biased estimates due to recommender interference.<n>We propose a recommender choice model'' that explicitly represents the interference pathway.<n>We validate our method with a large-scale field experiment on Weixin short-video platform.
arXiv Detail & Related papers (2024-06-20T14:53:26Z)
Aligning Large Language Models from Self-Reference AI Feedback with one General Principle [61.105703857868775]
We propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback. Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference. Finally, we determine which answer better fits human preferences according to the criticism.
arXiv Detail & Related papers (2024-06-17T03:51:46Z)
Inverse Constitutional AI: Compressing Preferences into Principles [37.28372419588119]
We introduce the Inverse Constitutional AI (ICAI) problem, formulating the interpretation of pairwise text preference data as a compression task. In constitutional AI, a set of principles (a constitution) is used to provide feedback and fine-tune AI models. We propose a corresponding ICAI algorithm and validate its generated constitutions on several datasets.
arXiv Detail & Related papers (2024-06-02T11:54:50Z)
RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation [24.374185140811115]
Reinforcement learning from human feedback (RLHF) has been an effective technique for aligning AI systems with human values. In this paper, we focus on addressing the issues due to the inherent heterogeneity in human preferences, as well as their potential strategic behavior in providing feedback. We propose two frameworks to address heterogeneous human feedback in principled ways: personalization-based one and aggregation-based one.
arXiv Detail & Related papers (2024-04-30T23:57:23Z)
Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset. We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z)
Rethinking Missing Data: Aleatoric Uncertainty-Aware Recommendation [59.500347564280204]
We propose a new Aleatoric Uncertainty-aware Recommendation (AUR) framework. AUR consists of a new uncertainty estimator along with a normal recommender model. As the chance of mislabeling reflects the potential of a pair, AUR makes recommendations according to the uncertainty.
arXiv Detail & Related papers (2022-09-22T04:32:51Z)
Bilateral Self-unbiased Learning from Biased Implicit Feedback [10.690479112143658]
We propose a novel unbiased recommender learning model, namely BIlateral SElf-unbiased Recommender (BISER) BISER consists of two key components: (i) self-inverse propensity weighting (SIPW) to gradually mitigate the bias of items without incurring high computational costs; and (ii) bilateral unbiased learning (BU) to bridge the gap between two complementary models in model predictions. Extensive experiments show that BISER consistently outperforms state-of-the-art unbiased recommender models over several datasets.
arXiv Detail & Related papers (2022-07-26T05:17:42Z)
Multiple Robust Learning for Recommendation [13.06593469196849]
In recommender systems, a common problem is the presence of various biases in the collected data. We propose a multiple robust (MR) estimator that can take the advantage of multiple candidate imputation and propensity models to achieve unbiasedness.
arXiv Detail & Related papers (2022-07-09T13:15:56Z)
Cross Pairwise Ranking for Unbiased Item Recommendation [57.71258289870123]
We develop a new learning paradigm named Cross Pairwise Ranking (CPR) CPR achieves unbiased recommendation without knowing the exposure mechanism. We prove in theory that this way offsets the influence of user/item propensity on the learning.
arXiv Detail & Related papers (2022-04-26T09:20:27Z)
Debiased Explainable Pairwise Ranking from Implicit Feedback [0.3867363075280543]
We focus on the state of the art pairwise ranking model, Bayesian Personalized Ranking (BPR) BPR is a black box model that does not explain its outputs, thus limiting the user's trust in the recommendations. We propose a novel explainable loss function and a corresponding Matrix Factorization-based model that generates recommendations along with item-based explanations.
arXiv Detail & Related papers (2021-07-30T17:19:37Z)
Evading the Simplicity Bias: Training a Diverse Set of Models Discovers Solutions with Superior OOD Generalization [93.8373619657239]
Neural networks trained with SGD were recently shown to rely preferentially on linearly-predictive features. This simplicity bias can explain their lack of robustness out of distribution (OOD) We demonstrate that the simplicity bias can be mitigated and OOD generalization improved.
arXiv Detail & Related papers (2021-05-12T12:12:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.