Related papers: Large Language Models Pass the Turing Test

Large Language Models Pass the Turing Test

URL: http://arxiv.org/abs/2503.23674v1
Date: Mon, 31 Mar 2025 02:37:45 GMT
Title: Large Language Models Pass the Turing Test
Authors: Cameron R. Jones, Benjamin K. Bergen,
Abstract summary: We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two Turing tests on independent populations.<n>Results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test.
Score: 0.913127392774573
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to -- while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.

Related papers

Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test [62.17144846428715]
We introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val)<n>Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization and execution.<n>For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models' execution accuracy in the real world.
arXiv Detail & Related papers (2026-01-07T17:50:37Z)
AI Models Exceed Individual Human Accuracy in Predicting Everyday Social Norms [0.4666493857924357]
We investigate whether large language models can achieve sophisticated norm understanding through statistical learning alone.<n>Across two studies, we evaluate multiple AI systems' ability to predict human social appropriateness judgments.<n>Despite this predictive power, all models showed systematic, correlated errors.
arXiv Detail & Related papers (2025-08-26T13:03:56Z)
Self-Directed Turing Test for Large Language Models [56.64615470513102]
The Turing test examines whether AIs can exhibit human-like behaviour in natural language conversations. Traditional Turing tests adopt a rigid dialogue format where each participant sends only one message each time. This paper proposes the Self-Directed Turing Test, which extends the original test with a burst dialogue format.
arXiv Detail & Related papers (2024-08-19T09:57:28Z)
Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design [63.24275274981911]
Compound AI Systems consisting of many language model inference calls are increasingly employed. In this work, we construct systems, which we call Networks of Networks (NoNs) organized around the distinction between generating a proposed answer and verifying its correctness. We introduce a verifier-based judge NoN with K generators, an instantiation of "best-of-K" or "judge-based" compound AI systems.
arXiv Detail & Related papers (2024-07-23T20:40:37Z)
GPT-4 is judged more human than humans in displaced and inverted Turing tests [0.7437224586066946]
Everyday AI detection requires differentiating between people and AI in online conversations. We measured how well people and large language models can discriminate using two modified versions of the Turing test: inverted and displaced.
arXiv Detail & Related papers (2024-07-11T20:28:24Z)
People cannot distinguish GPT-4 from a human in a Turing test [0.913127392774573]
GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%) Results have implications for debates around machine intelligence and, more urgently, suggest that deception by current AI systems may go undetected.
arXiv Detail & Related papers (2024-05-09T04:14:09Z)
How Well Can LLMs Echo Us? Evaluating AI Chatbots' Role-Play Ability with ECHO [55.25989137825992]
We introduce ECHO, an evaluative framework inspired by the Turing test. This framework engages the acquaintances of the target individuals to distinguish between human and machine-generated responses. We evaluate three role-playing LLMs using ECHO, with GPT-3.5 and GPT-4 serving as foundational models.
arXiv Detail & Related papers (2024-04-22T08:00:51Z)
On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial [10.770999939834985]
We analyze the effect of AI-driven persuasion in a controlled, harmless setting. We found that participants who debated GPT-4 with access to their personal information had 81.7% higher odds of increased agreement with their opponents compared to participants who debated humans.
arXiv Detail & Related papers (2024-03-21T13:14:40Z)
Does GPT-4 pass the Turing test? [0.913127392774573]
The best-performing GPT-4 prompt passed in 49.7% of games, outperforming ELIZA (22%) and GPT-3.5 (20%) We argue that the Turing test continues to be relevant as an assessment of naturalistic communication and deception.
arXiv Detail & Related papers (2023-10-31T06:27:52Z)
Evaluating Human-Language Model Interaction [79.33022878034627]
We develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems. We design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. We find that better non-interactive performance does not always translate to better human-LM interaction.
arXiv Detail & Related papers (2022-12-19T18:59:45Z)
Can Machines Imitate Humans? Integrative Turing Tests for Vision and Language Demonstrate a Narrowing Gap [45.6806234490428]
We benchmark current AIs in their abilities to imitate humans in three language tasks and three vision tasks. Experiments involved 549 human agents plus 26 AI agents for dataset creation, and 1,126 human judges plus 10 AI judges. Results reveal that current AIs are not far from being able to impersonate humans in complex language and vision challenges.
arXiv Detail & Related papers (2022-11-23T16:16:52Z)
Partner Matters! An Empirical Study on Fusing Personas for Personalized Response Selection in Retrieval-Based Chatbots [51.091235903442715]
This paper makes an attempt to explore the impact of utilizing personas that describe either self or partner speakers on the task of response selection. Four persona fusion strategies are designed, which assume personas interact with contexts or responses in different ways. Empirical studies on the Persona-Chat dataset show that the partner personas can improve the accuracy of response selection.
arXiv Detail & Related papers (2021-05-19T10:32:30Z)
Arabic Speech Recognition by End-to-End, Modular Systems and Human [56.96327247226586]
We perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM-DNN ASR, and human speech recognition. For ASR the end-to-end work led to 12.5%, 27.5%, 23.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively. Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.6% on average.
arXiv Detail & Related papers (2021-01-21T05:55:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.