Related papers: Anchors in the Machine: Behavioral and Attributional Evidence of Anchoring Bias in LLMs

Anchors in the Machine: Behavioral and Attributional Evidence of Anchoring Bias in LLMs

URL: http://arxiv.org/abs/2511.05766v1
Date: Fri, 07 Nov 2025 23:35:19 GMT
Title: Anchors in the Machine: Behavioral and Attributional Evidence of Anchoring Bias in LLMs
Authors: Felipe Valencia-Clavijo,
Abstract summary: This paper advances the study of anchoring in large language models (LLMs) through three contributions.<n>Results reveal robust anchoring effects in Gemma-2B, Phi-2, and Llama-2-7B, with attribution signaling that the anchors influence reweighting.<n>Findings demonstrate that anchoring bias in LLMs is robust, measurable, and interpretable, while highlighting risks in applied domains.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly examined as both behavioral subjects and decision systems, yet it remains unclear whether observed cognitive biases reflect surface imitation or deeper probability shifts. Anchoring bias, a classic human judgment bias, offers a critical test case. While prior work shows LLMs exhibit anchoring, most evidence relies on surface-level outputs, leaving internal mechanisms and attributional contributions unexplored. This paper advances the study of anchoring in LLMs through three contributions: (1) a log-probability-based behavioral analysis showing that anchors shift entire output distributions, with controls for training-data contamination; (2) exact Shapley-value attribution over structured prompt fields to quantify anchor influence on model log-probabilities; and (3) a unified Anchoring Bias Sensitivity Score integrating behavioral and attributional evidence across six open-source models. Results reveal robust anchoring effects in Gemma-2B, Phi-2, and Llama-2-7B, with attribution signaling that the anchors influence reweighting. Smaller models such as GPT-2, Falcon-RW-1B, and GPT-Neo-125M show variability, suggesting scale may modulate sensitivity. Attributional effects, however, vary across prompt designs, underscoring fragility in treating LLMs as human substitutes. The findings demonstrate that anchoring bias in LLMs is robust, measurable, and interpretable, while highlighting risks in applied domains. More broadly, the framework bridges behavioral science, LLM safety, and interpretability, offering a reproducible path for evaluating other cognitive biases in LLMs.

Related papers

Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM [0.0]
Prior work suggests that some LLMs alter choices when options are framed as causing pain or pleasure, and that such deviations can scale with stated intensity.<n>We investigate how valence-related information is represented and where it is causally used inside a transformer.<n>Using Gemma-2-9B-it and a minimalist decision task modelled on prior work, we (i) map representational availability with layer-wise linear probing across streams.<n>We (ii) test causal contribution with activation interventions (steering; patching/ablation), and (iii) quantify dose-response effects over an epsilon grid.
arXiv Detail & Related papers (2026-02-22T12:42:38Z)
How and Why LLMs Generalize: A Fine-Grained Analysis of LLM Reasoning from Cognitive Behaviors to Low-Level Patterns [51.02752099869218]
Large Language Models (LLMs) display strikingly different generalization behaviors.<n>We introduce a novel benchmark that decomposes reasoning into atomic core skills.<n>We show that RL-tuned models maintain more stable behavioral profiles and resist collapse in reasoning skills, whereas SFT models exhibit sharper drift and overfit to surface patterns.
arXiv Detail & Related papers (2025-12-30T08:16:20Z)
Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability [0.7710436567988378]
Large language models (LLMs) have been shown to internalize human-like biases during finetuning.<n> Knobe effect, a moral bias in intentionality, emerges in finetuned LLMs.<n> patching activations from corresponding pretrained model into just a few critical layers is sufficient to eliminate the effect.
arXiv Detail & Related papers (2025-10-14T07:31:29Z)
ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks [61.06621533874629]
In-context learning (ICL) has demonstrated remarkable success in large language models (LLMs)<n>In this paper, we propose, for the first time, the dual-learning hypothesis, which posits that LLMs simultaneously learn both the task-relevant latent concepts and backdoor latent concepts.<n>Motivated by these findings, we propose ICLShield, a defense mechanism that dynamically adjusts the concept preference ratio.
arXiv Detail & Related papers (2025-07-02T03:09:20Z)
Investigating the Effects of Cognitive Biases in Prompts on Large Language Model Outputs [3.7302076138352205]
This paper investigates the influence of cognitive biases on Large Language Models (LLMs) outputs.<n> cognitive biases, such as confirmation and availability biases, can distort user inputs through prompts.
arXiv Detail & Related papers (2025-06-14T04:18:34Z)
Mitigating Spurious Correlations in LLMs via Causality-Aware Post-Training [57.03005244917803]
Large language models (LLMs) often fail on out-of-distribution (OOD) samples due to spurious correlations acquired during pre-training.<n>Here, we aim to mitigate such spurious correlations through causality-aware post-training (CAPT)<n> Experiments on the formal causal inference benchmark CLadder and the logical reasoning dataset PrOntoQA show that 3B-scale language models fine-tuned with CAPT can outperform both traditional SFT and larger LLMs on in-distribution (ID) and OOD tasks.
arXiv Detail & Related papers (2025-06-11T06:30:28Z)
An Empirical Study of the Anchoring Effect in LLMs: Existence, Mechanism, and Potential Mitigations [12.481311145515706]
We investigate the anchoring effect, a cognitive bias where the mind relies heavily on the first information as anchors to make affected judgments.<n>To facilitate studies at scale on the anchoring effect, we introduce a new dataset, SynAnchors.<n>Our findings show that LLMs' anchoring bias exists commonly with shallow-layer acting and is not eliminated by conventional strategies.
arXiv Detail & Related papers (2025-05-21T11:33:54Z)
Seeing What's Not There: Spurious Correlation in Multimodal LLMs [47.651861502104715]
We introduce SpurLens, a pipeline that automatically identifies spurious visual cues without human supervision.<n>Our findings reveal that spurious correlations cause two major failure modes in Multimodal Large Language Models (MLLMs)<n>By exposing the persistence of spurious correlations, our study calls for more rigorous evaluation methods and mitigation strategies to enhance the reliability of MLLMs.
arXiv Detail & Related papers (2025-03-11T20:53:00Z)
Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z)
Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation. We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process. We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z)
How Good Are LLMs at Out-of-Distribution Detection? [13.35571704613836]
Out-of-distribution (OOD) detection plays a vital role in enhancing the reliability of machine learning (ML) models. This paper embarks on a pioneering empirical investigation of OOD detection in the domain of large language models (LLMs)
arXiv Detail & Related papers (2023-08-20T13:15:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.