Computational Turing Test Reveals Systematic Differences Between Human and AI Language
- URL: http://arxiv.org/abs/2511.04195v1
- Date: Thu, 06 Nov 2025 08:56:37 GMT
- Title: Computational Turing Test Reveals Systematic Differences Between Human and AI Language
- Authors: Nicolò Pagan, Petter Törnberg, Christopher A. Bail, Anikó Hannák, Christopher Barrie,
- Abstract summary: Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior.<n>Existing validation efforts rely heavily on human-judgment-based evaluations.<n>This paper introduces a computational Turing test to assess how closely LLMs approximate human language.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior, based on the assumption that they can generate realistic, human-like text. Yet this assumption remains largely untested. Existing validation efforts rely heavily on human-judgment-based evaluations -- testing whether humans can distinguish AI from human output -- despite evidence that such judgments are blunt and unreliable. As a result, the field lacks robust tools for assessing the realism of LLM-generated text or for calibrating models to real-world data. This paper makes two contributions. First, we introduce a computational Turing test: a validation framework that integrates aggregate metrics (BERT-based detectability and semantic similarity) with interpretable linguistic features (stylistic markers and topical patterns) to assess how closely LLMs approximate human language within a given dataset. Second, we systematically compare nine open-weight LLMs across five calibration strategies -- including fine-tuning, stylistic prompting, and context retrieval -- benchmarking their ability to reproduce user interactions on X (formerly Twitter), Bluesky, and Reddit. Our findings challenge core assumptions in the literature. Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression. Instruction-tuned models underperform their base counterparts, and scaling up model size does not enhance human-likeness. Crucially, we identify a trade-off: optimizing for human-likeness often comes at the cost of semantic fidelity, and vice versa. These results provide a much-needed scalable framework for validation and calibration in LLM simulations -- and offer a cautionary note about their current limitations in capturing human communication.
Related papers
- Individual Turing Test: A Case Study of LLM-based Simulation Using Longitudinal Personal Data [54.145424717168794]
Large Language Models (LLMs) have demonstrated remarkable human-like capabilities, yet their ability to replicate a specific individual remains under-explored.<n>This paper presents a case study to investigate LLM-based individual simulation with a volunteer-contributed archive of private messaging history spanning over ten years.<n>We propose the "Individual Turing Test" to evaluate whether acquaintances of the volunteer can correctly identify which response in a multi-candidate pool most plausibly comes from the volunteer.
arXiv Detail & Related papers (2026-03-01T21:46:27Z) - HumanLLM: Towards Personalized Understanding and Simulation of Human Nature [72.55730315685837]
HumanLLM is a foundation model designed for personalized understanding and simulation of individuals.<n>We first construct the Cognitive Genome, a large-scale corpus curated from real-world user data on platforms like Reddit, Twitter, Blogger, and Amazon.<n>We then formulate diverse learning tasks and perform supervised fine-tuning to empower the model to predict a wide range of individualized human behaviors, thoughts, and experiences.
arXiv Detail & Related papers (2026-01-22T09:27:27Z) - From Five Dimensions to Many: Large Language Models as Precise and Interpretable Psychological Profilers [14.983442449498739]
This study investigates whether and how Large Language Models can model the correlational structure of human psychological traits from minimal quantitative inputs.<n>We prompted various LLMs with Big Five Personality Scale responses from 816 human individuals to role-play their responses on nine other psychological scales.<n>LLMs demonstrated remarkable accuracy in capturing human psychological structure.
arXiv Detail & Related papers (2025-11-05T06:51:13Z) - TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation [55.55404595177229]
Large Language Models (LLMs) are exhibiting emergent human-like abilities.<n>TwinVoice is a benchmark for assessing persona simulation across diverse real-world contexts.
arXiv Detail & Related papers (2025-10-29T14:00:42Z) - ExaGPT: Example-Based Machine-Generated Text Detection for Human Interpretability [62.285407189502216]
Detecting texts generated by Large Language Models (LLMs) could cause grave mistakes due to incorrect decisions.<n>We introduce ExaGPT, an interpretable detection approach grounded in the human decision-making process.<n>We show that ExaGPT massively outperforms prior powerful detectors by up to +40.9 points of accuracy at a false positive rate of 1%.
arXiv Detail & Related papers (2025-02-17T01:15:07Z) - A Flexible Method for Behaviorally Measuring Alignment Between Human and Artificial Intelligence Using Representational Similarity Analysis [0.1957338076370071]
We adapted Representational Similarity Analysis (RSA), a method that uses pairwise similarity ratings to quantify alignment between AIs and humans.<n>We tested this approach on semantic alignment across text and image modalities, measuring how different Large Language and Vision Language Model (LLM and VLM) similarity judgments aligned with human responses at both group and individual levels.
arXiv Detail & Related papers (2024-11-30T20:24:52Z) - HLB: Benchmarking LLMs' Humanlikeness in Language Use [2.438748974410787]
We present a comprehensive humanlikeness benchmark (HLB) evaluating 20 large language models (LLMs)
We collected responses from over 2,000 human participants and compared them to outputs from the LLMs in these experiments.
Our results reveal fine-grained differences in how well LLMs replicate human responses across various linguistic levels.
arXiv Detail & Related papers (2024-09-24T09:02:28Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.