Related papers: Everyone prefers human writers, including AI

Everyone prefers human writers, including AI

URL: http://arxiv.org/abs/2510.08831v1
Date: Thu, 09 Oct 2025 21:33:30 GMT
Title: Everyone prefers human writers, including AI
Authors: Wouter Haverals, Meredith Martin,
Abstract summary: We conducted experiments using Raymond Queneaus Exercises Style (1947) to measure attribution bias.<n>Humans showed +13.7 percentage point (pp) bias (Cohen's h = 0.28, 95% CI: 0.21-0.34), while AI models showed +34.3 percentage point bias (h = 0.70, 95% CI: 0.65-0.76), a 2.5-fold stronger effect.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As AI writing tools become widespread, we need to understand how both humans and machines evaluate literary style, a domain where objective standards are elusive and judgments are inherently subjective. We conducted controlled experiments using Raymond Queneau's Exercises in Style (1947) to measure attribution bias across evaluators. Study 1 compared human participants (N=556) and AI models (N=13) evaluating literary passages from Queneau versus GPT-4-generated versions under three conditions: blind, accurately labeled, and counterfactually labeled. Study 2 tested bias generalization across a 14$\times$14 matrix of AI evaluators and creators. Both studies revealed systematic pro-human attribution bias. Humans showed +13.7 percentage point (pp) bias (Cohen's h = 0.28, 95% CI: 0.21-0.34), while AI models showed +34.3 percentage point bias (h = 0.70, 95% CI: 0.65-0.76), a 2.5-fold stronger effect (P$<$0.001). Study 2 confirmed this bias operates across AI architectures (+25.8pp, 95% CI: 24.1-27.6%), demonstrating that AI systems systematically devalue creative content when labeled as "AI-generated" regardless of which AI created it. We also find that attribution labels cause evaluators to invert assessment criteria, with identical features receiving opposing evaluations based solely on perceived authorship. This suggests AI models have absorbed human cultural biases against artificial creativity during training. Our study represents the first controlled comparison of attribution bias between human and artificial evaluators in aesthetic judgment, revealing that AI systems not only replicate but amplify this human tendency.

Related papers

Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers [8.031052360107092]
It's unclear if frontier AI models can generate high quality literary text while emulating authors' styles.<n>We compare MFA-trained expert writers with three frontier AI models: ChatGPT, Claude & Gemini in writing up to 450 word excerpts emulating 50 award-winning authors' diverse styles.
arXiv Detail & Related papers (2025-10-15T17:51:58Z)
Divergent Realities: A Comparative Analysis of Human Expert vs. Artificial Intelligence Based Generation and Evaluation of Treatment Plans in Dermatology [0.0]
evaluating AI-generated treatment plans is a key challenge as AI expands beyond diagnostics.<n>This study compares plans from human experts and two AI models (a generalist and a reasoner), assessed by both human peers and a superior AI judge.
arXiv Detail & Related papers (2025-07-08T06:59:58Z)
AI Debate Aids Assessment of Controversial Claims [73.8907110799657]
We study whether AI debate can guide biased judges toward the truth by having two AI systems debate opposing sides of controversial factuality claims.<n>In Study I, debate consistently improves human judgment accuracy and confidence calibration, outperforming consultancy.<n>In Study II, AI judges with human-like personas achieve even higher accuracy (78.5%) than human judges (70.1%) and default AI judges without personas (69.8%)<n>These findings highlight AI debate as a promising path toward scalable, bias-resilient oversight in contested domains.
arXiv Detail & Related papers (2025-06-02T19:01:53Z)
Charting the Parrot's Song: A Maximum Mean Discrepancy Approach to Measuring AI Novelty, Originality, and Distinctiveness [0.2209921757303168]
This paper introduces a robust, quantitative methodology to measure distributional differences between generative processes.<n>By comparing entire output distributions rather than conducting pairwise similarity checks, our approach directly contrasts creative processes.<n>This research provides courts and policymakers with a computationally efficient, legally relevant tool to quantify AI novelty.
arXiv Detail & Related papers (2025-04-11T11:15:26Z)
Benchmarking the rationality of AI decision making using the transitivity axiom [0.0]
We evaluate the rationality of AI responses via a series of choice experiments designed to evaluate transitivity of preference in humans.<n>We found that the Llama 2 and 3 models generally satisfied transitivity, but when violations did occur, occurred only in the Chat/Instruct versions of the LLMs.
arXiv Detail & Related papers (2025-02-14T20:56:40Z)
Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge [51.93909886542317]
We show how *relying on a single aggregate correlation score* can obscure fundamental differences between human labels and those from automatic evaluation.<n>We propose stratifying data by human label uncertainty to provide a more robust analysis of automatic evaluation performance.
arXiv Detail & Related papers (2024-10-03T03:08:29Z)
Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated [48.70176791365903]
This study explores how bias shapes the perception of AI versus human generated content.<n>We investigated how human raters respond to labeled and unlabeled content.
arXiv Detail & Related papers (2024-09-29T04:31:45Z)
AIGCOIQA2024: Perceptual Quality Assessment of AI Generated Omnidirectional Images [70.42666704072964]
We establish a large-scale AI generated omnidirectional image IQA database named AIGCOIQA2024. A subjective IQA experiment is conducted to assess human visual preferences from three perspectives. We conduct a benchmark experiment to evaluate the performance of state-of-the-art IQA models on our database.
arXiv Detail & Related papers (2024-04-01T10:08:23Z)
Can Machines Imitate Humans? Integrative Turing-like tests for Language and Vision Demonstrate a Narrowing Gap [56.611702960809644]
We benchmark AI's ability to imitate humans in three language tasks and three vision tasks.<n>Next, we conducted 72,191 Turing-like tests with 1,916 human judges and 10 AI judges.<n>Imitation ability showed minimal correlation with conventional AI performance metrics.
arXiv Detail & Related papers (2022-11-23T16:16:52Z)
Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.