Dissecting Human and LLM Preferences
- URL: http://arxiv.org/abs/2402.11296v1
- Date: Sat, 17 Feb 2024 14:34:31 GMT
- Title: Dissecting Human and LLM Preferences
- Authors: Junlong Li, Fan Zhou, Shichao Sun, Yikai Zhang, Hai Zhao, Pengfei Liu
- Abstract summary: We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits.
advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more.
We show that preference-based evaluation can be intentionally manipulated.
- Score: 80.55271307662365
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As a relative quality comparison of model responses, human and Large Language
Model (LLM) preferences serve as common alignment goals in model fine-tuning
and criteria in evaluation. Yet, these preferences merely reflect broad
tendencies, resulting in less explainable and controllable models with
potential safety risks. In this work, we dissect the preferences of human and
32 different LLMs to understand their quantitative composition, using
annotations from real-world user-model conversations for a fine-grained,
scenario-wise analysis. We find that humans are less sensitive to errors, favor
responses that support their stances, and show clear dislike when models admit
their limits. On the contrary, advanced LLMs like GPT-4-Turbo emphasize
correctness, clarity, and harmlessness more. Additionally, LLMs of similar
sizes tend to exhibit similar preferences, regardless of their training
methods, and fine-tuning for alignment does not significantly alter the
preferences of pretrained-only LLMs. Finally, we show that preference-based
evaluation can be intentionally manipulated. In both training-free and
training-based settings, aligning a model with the preferences of judges boosts
scores, while injecting the least preferred properties lowers them. This
results in notable score shifts: up to 0.59 on MT-Bench (1-10 scale) and 31.94
on AlpacaEval 2.0 (0-100 scale), highlighting the significant impact of this
strategic adaptation. Interactive Demo:
https://huggingface.co/spaces/GAIR/Preference-Dissection-Visualization Dataset:
https://huggingface.co/datasets/GAIR/preference-dissection Code:
https://github.com/GAIR-NLP/Preference-Dissection
Related papers
- Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback [87.37721254914476]
We introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality.
We train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations.
We show that the selected hybrid mixture achieves better reward model performance compared to using either one exclusively.
arXiv Detail & Related papers (2024-10-24T20:04:15Z) - Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss.
The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z) - Geometric-Averaged Preference Optimization for Soft Preference Labels [78.2746007085333]
Many algorithms for aligning LLMs with human preferences assume that human preferences are binary and deterministic.
In this work, we introduce the distributional soft preference labels and improve Direct Preference Optimization (DPO) with a weighted geometric average of the LLM output likelihood in the loss function.
arXiv Detail & Related papers (2024-09-10T17:54:28Z) - Aligning Large Language Models with Self-generated Preference Data [72.99676237703099]
We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences.
Our key idea is leveraging the human prior knowledge within the small (seed) data.
We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - Preference Learning Algorithms Do Not Learn Preference Rankings [62.335733662381884]
We study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs.
We find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets.
arXiv Detail & Related papers (2024-05-29T21:29:44Z) - Do Large Language Models Learn Human-Like Strategic Preferences? [0.0]
LLMs learn to make human-like preference judgements in strategic scenarios.
Solar and Mistral are shown to exhibit stable value-based preference consistent with humans.
arXiv Detail & Related papers (2024-04-11T19:13:24Z) - Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators [59.48172585509628]
We propose a simple regression analysis approach for controlling biases in auto-evaluations.
As a real case study, we focus on reducing the length bias of AlpacaEval, a benchmark for chat LLMs.
We introduce a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?"
arXiv Detail & Related papers (2024-04-06T02:29:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.