Related papers: Do Large Language Models Learn Human-Like Strategic Preferences?

Do Large Language Models Learn Human-Like Strategic Preferences?

URL: http://arxiv.org/abs/2404.08710v1
Date: Thu, 11 Apr 2024 19:13:24 GMT
Title: Do Large Language Models Learn Human-Like Strategic Preferences?
Authors: Jesse Roberts, Kyle Moore, Doug Fisher,
Abstract summary: We show that Solar and Mistral exhibit stable value-based preference consistent with human in the prisoner's dilemma. We establish a relationship between model size, value based preference, and superficiality.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We evaluate whether LLMs learn to make human-like preference judgements in strategic scenarios as compared with known empirical results. We show that Solar and Mistral exhibit stable value-based preference consistent with human in the prisoner's dilemma, including stake-size effect, and traveler's dilemma, including penalty-size effect. We establish a relationship between model size, value based preference, and superficiality. Finally, we find that models that tend to be less brittle were trained with sliding window attention. Additionally, we contribute a novel method for constructing preference relations from arbitrary LLMs and support for a hypothesis regarding human behavior in the traveler's dilemma.

Related papers

Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback [87.37721254914476]
We introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality. We train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations. We show that the selected hybrid mixture achieves better reward model performance compared to using either one exclusively.
arXiv Detail & Related papers (2024-10-24T20:04:15Z)
Diverging Preferences: When do Annotators Disagree and do Models Know? [92.24651142187989]
We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes. We find that the majority of disagreements are in opposition with standard reward modeling approaches. We develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
arXiv Detail & Related papers (2024-10-18T17:32:22Z)
Uncovering Factor Level Preferences to Improve Human-Model Alignment [58.50191593880829]
We introduce PROFILE, a framework that uncovers and quantifies the influence of specific factors driving preferences. ProFILE's factor level analysis explains the 'why' behind human-model alignment and misalignment. We demonstrate how leveraging factor level insights, including addressing misaligned factors, can improve alignment with human preferences.
arXiv Detail & Related papers (2024-10-09T15:02:34Z)
A Survey on Human Preference Learning for Large Language Models [81.41868485811625]
The recent surge of versatile large language models (LLMs) largely depends on aligning increasingly capable foundation models with human intentions by preference learning. This survey covers the sources and formats of preference feedback, the modeling and usage of preference signals, as well as the evaluation of the aligned LLMs.
arXiv Detail & Related papers (2024-06-17T03:52:51Z)
Using LLMs to Model the Beliefs and Preferences of Targeted Populations [4.0849074543032105]
We consider the problem of aligning a large language model (LLM) to model the preferences of a human population. Modeling the beliefs, preferences, and behaviors of a specific population can be useful for a variety of different applications.
arXiv Detail & Related papers (2024-03-29T15:58:46Z)
Dissecting Human and LLM Preferences [80.55271307662365]
We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits. advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more. We show that preference-based evaluation can be intentionally manipulated.
arXiv Detail & Related papers (2024-02-17T14:34:31Z)
On Diversified Preferences of Large Language Model Alignment [51.26149027399505]
This paper presents the first quantitative analysis of the experimental scaling law for reward models with varying sizes. Our analysis reveals that the impact of diversified human preferences depends on both model size and data size. Larger models with sufficient capacity mitigate the negative effects of diverse preferences, while smaller models struggle to accommodate them.
arXiv Detail & Related papers (2023-12-12T16:17:15Z)
Nash Learning from Human Feedback [86.09617990412941]
We introduce an alternative pipeline for the fine-tuning of large language models using pairwise human feedback. We term this approach Nash learning from human feedback (NLHF) We present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent.
arXiv Detail & Related papers (2023-12-01T19:26:23Z)
A density estimation perspective on learning from pairwise human preferences [32.64330423345252]
We show that for a family of generative processes defined via preference behavior distribution equations, training a reward function on pairwise preferences effectively models an annotator's implicit preference distribution. We discuss and present findings on "annotator misspecification" -- failure cases where wrong modeling assumptions are made about annotator behavior, resulting in poorly-adapted models.
arXiv Detail & Related papers (2023-11-23T17:20:36Z)
Can LLMs Capture Human Preferences? [5.683832910692926]
We explore the viability of Large Language Models (LLMs) in emulating human survey respondents and eliciting preferences. We compare responses from LLMs across various languages and compare them to human responses, exploring preferences between smaller, sooner, and larger, later rewards. Our findings reveal that both GPT models demonstrate less patience than humans, with GPT-3.5 exhibiting a lexicographic preference for earlier rewards, unlike human decision-makers.
arXiv Detail & Related papers (2023-05-04T03:51:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.