Axiomatic Preference Modeling for Longform Question Answering
- URL: http://arxiv.org/abs/2312.02206v1
- Date: Sat, 2 Dec 2023 23:11:41 GMT
- Title: Axiomatic Preference Modeling for Longform Question Answering
- Authors: Corby Rosset, Guoqing Zheng, Victor Dibia, Ahmed Awadallah, Paul
Bennett
- Abstract summary: We develop an axiomatic framework to generate a rich variety of preference signals to uphold human preferences.
We use these axiomatic signals to train a model for scoring answers to longform questions.
Our approach yields a Preference Model with only about 220M parameters that agrees with gold human-annotated preference labels more often than GPT-4.
- Score: 15.675861802061078
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The remarkable abilities of large language models (LLMs) like GPT-4 partially
stem from post-training processes like Reinforcement Learning from Human
Feedback (RLHF) involving human preferences encoded in a reward model. However,
these reward models (RMs) often lack direct knowledge of why, or under what
principles, the preferences annotations were made. In this study, we identify
principles that guide RMs to better align with human preferences, and then
develop an axiomatic framework to generate a rich variety of preference signals
to uphold them. We use these axiomatic signals to train a model for scoring
answers to longform questions. Our approach yields a Preference Model with only
about 220M parameters that agrees with gold human-annotated preference labels
more often than GPT-4. The contributions of this work include: training a
standalone preference model that can score human- and LLM-generated answers on
the same scale; developing an axiomatic framework for generating training data
pairs tailored to certain principles; and showing that a small amount of
axiomatic signals can help small models outperform GPT-4 in preference scoring.
We release our model on huggingface:
https://huggingface.co/corbyrosset/axiomatic_preference_model
Related papers
- General Preference Modeling with Preference Representations for Aligning Language Models [51.14207112118503]
We introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently.
We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback.
Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z) - Aligning Language Models Using Follow-up Likelihood as Reward Signal [40.388526412214276]
We propose using the likelihood of follow-up utterances as rewards to differentiate preferred responses from less favored ones.
Our proposed reward mechanism, Follow-up Likelihood as Reward" (FLR), matches the performance of strong reward models trained on large-scale human or GPT-4 annotated data.
arXiv Detail & Related papers (2024-09-20T23:47:25Z) - Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts [23.27203570485055]
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models with human preferences.
We propose a two-stage approach to train a reward model (RM) with multi-dimensional absolute-rating data.
We efficiently trained an ArmoRM with Llama-3 8B and a gating network consisting of a shallow on top of the ArmoRM.
arXiv Detail & Related papers (2024-06-18T17:58:28Z) - Multi-objective Reinforcement learning from AI Feedback [0.0]
This paper presents a novel approach to improve the alignment and performance of language models trained using reinforcement learning from AI feedback (RLAIF)
In contrast to standard approaches that train a single preference model to represent all human preferences, MORLAIF decomposes this task into simpler principles, such as toxicity, factuality, and sycophancy.
Our experiments indicate that MORLAIF outperforms the standard RLAIF baselines and that MORLAIF can be used to align larger language models using smaller ones.
arXiv Detail & Related papers (2024-06-11T14:24:00Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - Dissecting Human and LLM Preferences [80.55271307662365]
We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits.
advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more.
We show that preference-based evaluation can be intentionally manipulated.
arXiv Detail & Related papers (2024-02-17T14:34:31Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision.
We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z) - RRHF: Rank Responses to Align Language Models with Human Feedback
without tears [69.68672043223249]
InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO)
We propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities.
We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling.
arXiv Detail & Related papers (2023-04-11T15:53:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.