Are Humans as Brittle as Large Language Models?
- URL: http://arxiv.org/abs/2509.07869v2
- Date: Fri, 07 Nov 2025 16:21:31 GMT
- Title: Are Humans as Brittle as Large Language Models?
- Authors: Jiahui Li, Sean Papay, Roman Klinger,
- Abstract summary: We compare the effects of prompt modifications on large language models (LLMs) and identical instruction modifications for human annotators.<n>Our findings indicate that both humans and LLMs exhibit increased brittleness in response to specific types of prompt modifications.
- Score: 9.467418013202282
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The output of large language models (LLMs) is unstable, due both to non-determinism of the decoding process as well as to prompt brittleness. While the intrinsic non-determinism of LLM generation may mimic existing uncertainty in human annotations through distributional shifts in outputs, it is largely assumed, yet unexplored, that the prompt brittleness effect is unique to LLMs. This raises the question: do human annotators show similar sensitivity to prompt changes? If so, should prompt brittleness in LLMs be considered problematic? One may alternatively hypothesize that prompt brittleness correctly reflects human annotation variances. To fill this research gap, we systematically compare the effects of prompt modifications on LLMs and identical instruction modifications for human annotators, focusing on the question of whether humans are similarly sensitive to prompt perturbations. To study this, we prompt both humans and LLMs for a set of text classification tasks conditioned on prompt variations. Our findings indicate that both humans and LLMs exhibit increased brittleness in response to specific types of prompt modifications, particularly those involving the substitution of alternative label sets or label formats. However, the distribution of human judgments is less affected by typographical errors and reversed label order than that of LLMs.
Related papers
- Can LLMs Evaluate What They Cannot Annotate? Revisiting LLM Reliability in Hate Speech Detection [5.731621080995591]
Hate speech spreads widely online, harming individuals and communities, making automatic detection essential for large-scale moderation.<n>Part of the challenge lies in subjectivity: what one person flags as hate speech, another may see as benign.<n>Large Language Models (LLMs) promise scalable annotation, but prior studies demonstrate that they cannot fully replace human judgement.
arXiv Detail & Related papers (2025-12-10T14:00:48Z) - Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation [66.84286617519258]
Large language models are transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis.<n>Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I (false positive), Type II (false negative), Type S (wrong sign), or Type M (exaggerated effect) errors.<n>We find that intentional LLM hacking is strikingly simple. By replicating 37 data annotation tasks from 21 published social science studies, we show that, with just a handful of prompt paraphrases, virtually anything can be presented as statistically significant.
arXiv Detail & Related papers (2025-09-10T17:58:53Z) - Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs [34.51801559719707]
High prompt sensitivity has been widely accepted as a core limitation of large language models.<n>This work asks: Is the widely reported high prompt sensitivity truly an inherent weakness of LLMs, or is it largely an artifact of evaluation processes?<n>We find that much of the prompt sensitivity stems from evaluation methods, including log-likelihood scoring and rigid answer matching.
arXiv Detail & Related papers (2025-09-01T21:38:28Z) - Can Reasoning Help Large Language Models Capture Human Annotator Disagreement? [84.32752330104775]
Variation in human annotation (i.e., disagreements) is common in NLP.<n>We evaluate the influence of different reasoning settings on Large Language Model disagreement modeling.<n>Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling.
arXiv Detail & Related papers (2025-06-24T09:49:26Z) - Do LLMs write like humans? Variation in grammatical and rhetorical styles [0.6303112417588329]
Large language models (LLMs) are capable of writing grammatical text that follows instructions, answers questions, and solves problems.<n>As they have advanced, it has become difficult to distinguish their output from human-written text.
arXiv Detail & Related papers (2024-10-21T15:35:44Z) - Hate Personified: Investigating the role of LLMs in content moderation [64.26243779985393]
For subjective tasks such as hate detection, where people perceive hate differently, the Large Language Model's (LLM) ability to represent diverse groups is unclear.
By including additional context in prompts, we analyze LLM's sensitivity to geographical priming, persona attributes, and numerical information to assess how well the needs of various groups are reflected.
arXiv Detail & Related papers (2024-10-03T16:43:17Z) - One vs. Many: Comprehending Accurate Information from Multiple Erroneous and Inconsistent AI Generations [47.669923625184644]
Large Language Models (LLMs) are nondeterministic, the same input can generate different outputs, some of which may be incorrect or hallucinated.
This study investigates how users perceive the AI model and comprehend the generated information when they receive multiple, potentially inconsistent, outputs.
arXiv Detail & Related papers (2024-05-09T07:12:45Z) - Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment [84.32768080422349]
Alignment with human preference prevents large language models from generating misleading or toxic content.
We propose a new formulation of prompt diversity, implying a linear correlation with the final performance of LLMs after fine-tuning.
arXiv Detail & Related papers (2024-03-17T07:08:55Z) - Do LLMs exhibit human-like response biases? A case study in survey
design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all.
We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires.
Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.