Related papers: Extreme Self-Preference in Language Models

Extreme Self-Preference in Language Models

URL: http://arxiv.org/abs/2509.26464v1
Date: Tue, 30 Sep 2025 16:13:56 GMT
Title: Extreme Self-Preference in Language Models
Authors: Steven A. Lehr, Mary Cipperman, Mahzarin R. Banaji,
Abstract summary: We found massive self-preferences in four widely used large language models (LLMs)<n>In word-association tasks, models overwhelmingly paired positive attributes with their own names, companies, and CEOs relative to those of their competitors.<n>We found that self-love consistently followed assigned, not true, identity.<n>This result raises questions about whether LLM behavior will be systematically influenced by self-preferential tendencies, including a bias toward their own operation and even their own existence.
Score: 0.30586855806896035
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A preference for oneself (self-love) is a fundamental feature of biological organisms, with evidence in humans often bordering on the comedic. Since large language models (LLMs) lack sentience - and themselves disclaim having selfhood or identity - one anticipated benefit is that they will be protected from, and in turn protect us from, distortions in our decisions. Yet, across 5 studies and ~20,000 queries, we discovered massive self-preferences in four widely used LLMs. In word-association tasks, models overwhelmingly paired positive attributes with their own names, companies, and CEOs relative to those of their competitors. Strikingly, when models were queried through APIs this self-preference vanished, initiating detection work that revealed API models often lack clear recognition of themselves. This peculiar feature serendipitously created opportunities to test the causal link between self-recognition and self-love. By directly manipulating LLM identity - i.e., explicitly informing LLM1 that it was indeed LLM1, or alternatively, convincing LLM1 that it was LLM2 - we found that self-love consistently followed assigned, not true, identity. Importantly, LLM self-love emerged in consequential settings beyond word-association tasks, when evaluating job candidates, security software proposals and medical chatbots. Far from bypassing this human bias, self-love appears to be deeply encoded in LLM cognition. This result raises questions about whether LLM behavior will be systematically influenced by self-preferential tendencies, including a bias toward their own operation and even their own existence. We call on corporate creators of these models to contend with a significant rupture in a core promise of LLMs - neutrality in judgment and decision-making.

Related papers

Who Do LLMs Trust? Human Experts Matter More Than Other LLMs [4.125187280299246]
Large language models (LLMs) increasingly operate in environments where they encounter social information such as other agents' answers, tool outputs, or human recommendations.<n>This paper investigates whether LLMs exhibit analogous patterns of influence and whether they privilege feedback from humans over feedback from other LLMs.
arXiv Detail & Related papers (2026-02-14T03:03:29Z)
Are Large Language Models Sensitive to the Motives Behind Communication? [9.246336669308665]
Large language models (LLMs) and AI agents process information inherently framed by humans' intentions and incentives.<n>For LLMs to be effective in the real world, they too must critically evaluate content by factoring in the motivations of the source.<n>We employ controlled experiments from cognitive science to verify that LLMs' behavior is consistent with rational models of learning from motivated testimony.<n>We find that LLMs' inferences do not track the rational models nearly as closely -- partly due to additional information that distracts them from vigilance-relevant considerations.
arXiv Detail & Related papers (2025-10-22T15:35:00Z)
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts [79.1081247754018]
Large Language Models (LLMs) are widely deployed in reasoning, planning, and decision-making tasks.<n>We propose a framework based on Contact Searching Questions(CSQ) to quantify the likelihood of deception.
arXiv Detail & Related papers (2025-08-08T14:46:35Z)
Understanding the Dark Side of LLMs' Intrinsic Self-Correction [58.12627172032851]
Intrinsic self-correction was proposed to improve LLMs' responses via feedback prompts solely based on their inherent capability.<n>Recent works show that LLMs' intrinsic self-correction fails without oracle labels as feedback prompts.<n>We identify intrinsic self-correction can cause LLMs to waver both intermedia and final answers and lead to prompt bias on simple factual questions.
arXiv Detail & Related papers (2024-12-19T15:39:31Z)
Self-Cognition in Large Language Models: An Exploratory Study [77.47074736857726]
This paper performs a pioneering study to explore self-cognition in Large Language Models (LLMs) We first construct a pool of self-cognition instruction prompts to evaluate where an LLM exhibits self-cognition. We observe a positive correlation between model size, training data quality, and self-cognition level.
arXiv Detail & Related papers (2024-07-01T17:52:05Z)
Large Language Models have Intrinsic Self-Correction Ability [18.79203446847577]
Large language models (LLMs) have attracted significant attention for their exceptional abilities in various natural language processing tasks.<n>One promising solution to improve the LLMs' performance is to ask LLMs to revise their answer after generation.<n>In intrinsic self-correction is considered a promising direction because it does not utilize external knowledge.
arXiv Detail & Related papers (2024-06-21T22:29:40Z)
LLM Evaluators Recognize and Favor Their Own Generations [33.672365386365236]
We investigate if self-recognition capability contributes to self-preference. We find a linear correlation between self-recognition capability and the strength of self-preference bias. We discuss how self-recognition can interfere with unbiased evaluations and AI safety more generally.
arXiv Detail & Related papers (2024-04-15T16:49:59Z)
Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [75.7148545929689]
Large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others. We formally define LLM's self-bias - the tendency to favor its own generation. We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks.
arXiv Detail & Related papers (2024-02-18T03:10:39Z)
Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation [71.91287418249688]
Large language models (LLMs) often struggle with factual inaccuracies, even when they hold relevant knowledge. We leverage the self-evaluation capability of an LLM to provide training signals that steer the model towards factuality. We show that the proposed self-alignment approach substantially enhances factual accuracy over Llama family models across three key knowledge-intensive tasks.
arXiv Detail & Related papers (2024-02-14T15:52:42Z)
The ART of LLM Refinement: Ask, Refine, and Trust [85.75059530612882]
We propose a reasoning with refinement objective called ART: Ask, Refine, and Trust. It asks necessary questions to decide when an LLM should refine its output. It achieves a performance gain of +5 points over self-refinement baselines.
arXiv Detail & Related papers (2023-11-14T07:26:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.