Dissecting Human and LLM Preferences
- URL: http://arxiv.org/abs/2402.11296v1
- Date: Sat, 17 Feb 2024 14:34:31 GMT
- Title: Dissecting Human and LLM Preferences
- Authors: Junlong Li, Fan Zhou, Shichao Sun, Yikai Zhang, Hai Zhao, Pengfei Liu
- Abstract summary: We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits.
advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more.
We show that preference-based evaluation can be intentionally manipulated.
- Score: 80.55271307662365
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As a relative quality comparison of model responses, human and Large Language
Model (LLM) preferences serve as common alignment goals in model fine-tuning
and criteria in evaluation. Yet, these preferences merely reflect broad
tendencies, resulting in less explainable and controllable models with
potential safety risks. In this work, we dissect the preferences of human and
32 different LLMs to understand their quantitative composition, using
annotations from real-world user-model conversations for a fine-grained,
scenario-wise analysis. We find that humans are less sensitive to errors, favor
responses that support their stances, and show clear dislike when models admit
their limits. On the contrary, advanced LLMs like GPT-4-Turbo emphasize
correctness, clarity, and harmlessness more. Additionally, LLMs of similar
sizes tend to exhibit similar preferences, regardless of their training
methods, and fine-tuning for alignment does not significantly alter the
preferences of pretrained-only LLMs. Finally, we show that preference-based
evaluation can be intentionally manipulated. In both training-free and
training-based settings, aligning a model with the preferences of judges boosts
scores, while injecting the least preferred properties lowers them. This
results in notable score shifts: up to 0.59 on MT-Bench (1-10 scale) and 31.94
on AlpacaEval 2.0 (0-100 scale), highlighting the significant impact of this
strategic adaptation. Interactive Demo:
https://huggingface.co/spaces/GAIR/Preference-Dissection-Visualization Dataset:
https://huggingface.co/datasets/GAIR/preference-dissection Code:
https://github.com/GAIR-NLP/Preference-Dissection
Related papers
- Aligning Large Language Models with Self-generated Preference Data [72.99676237703099]
We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences.
Our key idea is leveraging the human prior knowledge within the small (seed) data.
We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - Preference Learning Algorithms Do Not Learn Preference Rankings [62.335733662381884]
We show that most preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets.
We attribute this discrepancy to the DPO objective, which is empirically and theoretically ill-suited to fix even mild ranking errors.
arXiv Detail & Related papers (2024-05-29T21:29:44Z) - Do Large Language Models Learn Human-Like Strategic Preferences? [0.0]
We show that Solar and Mistral exhibit stable value-based preference consistent with human in the prisoner's dilemma.
We establish a relationship between model size, value based preference, and superficiality.
arXiv Detail & Related papers (2024-04-11T19:13:24Z) - Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators [59.48172585509628]
We propose a simple regression analysis approach for controlling biases in auto-evaluations.
As a real case study, we focus on reducing the length bias of AlpacaEval, a benchmark for chat LLMs.
We introduce a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?"
arXiv Detail & Related papers (2024-04-06T02:29:02Z) - Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts.
RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z) - On Diversified Preferences of Large Language Model Alignment [51.26149027399505]
We investigate the impact of diversified preferences on reward modeling.
We find that diversified preference data negatively affect the calibration performance of reward models.
We propose a novel Multi-Objective Reward learning method to enhance the calibration performance of RMs on shared preferences.
arXiv Detail & Related papers (2023-12-12T16:17:15Z) - Compositional preference models for aligning LMs [15.036426712762147]
Compositional Preference Models (CPMs) are a framework that decomposes one global preference assessment into several interpretable features.
CPMs allow to control which properties of the preference data are used to train the preference model and to build it based on features that are believed to underlie the human preference judgment.
arXiv Detail & Related papers (2023-10-17T01:31:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.