Peering Through Preferences: Unraveling Feedback Acquisition for
Aligning Large Language Models
- URL: http://arxiv.org/abs/2308.15812v3
- Date: Mon, 5 Feb 2024 19:59:46 GMT
- Title: Peering Through Preferences: Unraveling Feedback Acquisition for
Aligning Large Language Models
- Authors: Hritik Bansal, John Dang, Aditya Grover
- Abstract summary: We analyze the effect of sparse feedback on the alignment and evaluation of large language models.
We find that preferences from ratings and rankings significantly disagree 60% for both human and AI annotators.
Our findings shed light on critical gaps in methods for evaluating the real-world utility of language models.
- Score: 32.843361525236965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aligning large language models (LLMs) with human values and intents
critically involves the use of human or AI feedback. While dense feedback
annotations are expensive to acquire and integrate, sparse feedback presents a
structural design choice between ratings (e.g., score Response A on a scale of
1-7) and rankings (e.g., is Response A better than Response B?). In this work,
we analyze the effect of this design choice for the alignment and evaluation of
LLMs. We uncover an inconsistency problem wherein the preferences inferred from
ratings and rankings significantly disagree 60% for both human and AI
annotators. Our subsequent analysis identifies various facets of annotator
biases that explain this phenomena, such as human annotators would rate denser
responses higher while preferring accuracy during pairwise judgments. To our
surprise, we also observe that the choice of feedback protocol also has a
significant effect on the evaluation of aligned LLMs. In particular, we find
that LLMs that leverage rankings data for alignment (say model X) are preferred
over those that leverage ratings data (say model Y), with a rank-based
evaluation protocol (is X/Y's response better than reference response?) but not
with a rating-based evaluation protocol (score Rank X/Y's response on a scale
of 1-7). Our findings thus shed light on critical gaps in methods for
evaluating the real-world utility of language models and their strong
dependence on the feedback protocol used for alignment. Our code and data are
available at https://github.com/Hritikbansal/sparse_feedback.
Related papers
- Reward Modeling with Ordinal Feedback: Wisdom of the Crowd [9.034189257088762]
Learning a reward model (RM) from human preferences has been an important component in aligning large language models.
We propose a framework for learning RMs under ordinal feedback.
We prove the statistical benefits of ordinal feedback in terms of reducing the Rademacher complexity.
arXiv Detail & Related papers (2024-11-19T20:17:04Z) - Aligning Large Language Models from Self-Reference AI Feedback with one General Principle [61.105703857868775]
We propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback.
Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference.
Finally, we determine which answer better fits human preferences according to the criticism.
arXiv Detail & Related papers (2024-06-17T03:51:46Z) - Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
A common technique for aligning large language models (LLMs) relies on acquiring human preferences.
We propose a new axis that is based on eliciting preferences jointly over the instruction-response pairs.
We find that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs.
arXiv Detail & Related papers (2024-03-31T02:05:40Z) - Dissecting Human and LLM Preferences [80.55271307662365]
We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits.
advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more.
We show that preference-based evaluation can be intentionally manipulated.
arXiv Detail & Related papers (2024-02-17T14:34:31Z) - RLVF: Learning from Verbal Feedback without Overgeneralization [94.19501420241188]
We study the problem of incorporating verbal feedback without such overgeneralization.
We develop a new method Contextualized Critiques with Constrained Preference Optimization (C3PO)
Our approach effectively applies verbal feedback to relevant scenarios while preserving existing behaviors for other contexts.
arXiv Detail & Related papers (2024-02-16T18:50:24Z) - PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations [10.709365940160685]
Modern large language models (LLMs) are hard to evaluate and compare automatically.
We propose a peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs.
We find that our approaches achieve higher accuracy and align better with human judgments.
arXiv Detail & Related papers (2023-07-06T04:05:44Z) - Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)
We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.
Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z) - Dialogue Response Ranking Training with Large-Scale Human Feedback Data [52.12342165926226]
We leverage social media feedback data to build a large-scale training dataset for feedback prediction.
We trained DialogRPT, a set of GPT-2 based models on 133M pairs of human feedback data.
Our ranker outperforms the conventional dialog perplexity baseline with a large margin on predicting Reddit feedback.
arXiv Detail & Related papers (2020-09-15T10:50:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.