People cannot distinguish GPT-4 from a human in a Turing test
- URL: http://arxiv.org/abs/2405.08007v1
- Date: Thu, 9 May 2024 04:14:09 GMT
- Title: People cannot distinguish GPT-4 from a human in a Turing test
- Authors: Cameron R. Jones, Benjamin K. Bergen,
- Abstract summary: GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%)
Results have implications for debates around machine intelligence and, more urgently, suggest that deception by current AI systems may go undetected.
- Score: 0.913127392774573
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We evaluated 3 systems (ELIZA, GPT-3.5 and GPT-4) in a randomized, controlled, and preregistered Turing test. Human participants had a 5 minute conversation with either a human or an AI, and judged whether or not they thought their interlocutor was human. GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%). The results provide the first robust empirical demonstration that any artificial system passes an interactive 2-player Turing test. The results have implications for debates around machine intelligence and, more urgently, suggest that deception by current AI systems may go undetected. Analysis of participants' strategies and reasoning suggests that stylistic and socio-emotional factors play a larger role in passing the Turing test than traditional notions of intelligence.
Related papers
- GPT-4 is judged more human than humans in displaced and inverted Turing tests [0.7437224586066946]
Everyday AI detection requires differentiating between people and AI in online conversations.
We measured how well people and large language models can discriminate using two modified versions of the Turing test: inverted and displaced.
arXiv Detail & Related papers (2024-07-11T20:28:24Z) - How Well Can LLMs Echo Us? Evaluating AI Chatbots' Role-Play Ability with ECHO [55.25989137825992]
We introduce ECHO, an evaluative framework inspired by the Turing test.
This framework engages the acquaintances of the target individuals to distinguish between human and machine-generated responses.
We evaluate three role-playing LLMs using ECHO, with GPT-3.5 and GPT-4 serving as foundational models.
arXiv Detail & Related papers (2024-04-22T08:00:51Z) - GPT-4 Understands Discourse at Least as Well as Humans Do [1.3499500088995462]
GPT-4 performs slightly, but not statistically significantly, better than humans given the very high level of human performance.
Both GPT-4 and humans exhibit a strong ability to make inferences about information that is not explicitly stated in a story, a critical test of understanding.
arXiv Detail & Related papers (2024-03-25T21:17:14Z) - Does GPT-4 pass the Turing test? [0.913127392774573]
The best-performing GPT-4 prompt passed in 49.7% of games, outperforming ELIZA (22%) and GPT-3.5 (20%)
We argue that the Turing test continues to be relevant as an assessment of naturalistic communication and deception.
arXiv Detail & Related papers (2023-10-31T06:27:52Z) - Sparks of Artificial General Intelligence: Early experiments with GPT-4 [66.1188263570629]
GPT-4, developed by OpenAI, was trained using an unprecedented scale of compute and data.
We demonstrate that GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more.
We believe GPT-4 could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.
arXiv Detail & Related papers (2023-03-22T16:51:28Z) - ChatGPT: Jack of all trades, master of none [4.693597927153063]
OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT)
We examined ChatGPT's capabilities on 25 diverse analytical NLP tasks.
We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses.
arXiv Detail & Related papers (2023-02-21T15:20:37Z) - Evaluating Human-Language Model Interaction [79.33022878034627]
We develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems.
We design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation.
We find that better non-interactive performance does not always translate to better human-LM interaction.
arXiv Detail & Related papers (2022-12-19T18:59:45Z) - The Role of AI in Drug Discovery: Challenges, Opportunities, and
Strategies [97.5153823429076]
The benefits, challenges and drawbacks of AI in this field are reviewed.
The use of data augmentation, explainable AI, and the integration of AI with traditional experimental methods are also discussed.
arXiv Detail & Related papers (2022-12-08T23:23:39Z) - Human or Machine? Turing Tests for Vision and Language [22.110556671410624]
We systematically benchmark current AIs in their abilities to imitate humans.
Experiments involved testing 769 human agents, 24 state-of-the-art AI agents, 896 human judges, and 8 AI judges.
Results reveal that current AIs are not far from being able to impersonate human judges across different genders, ages, and educational levels.
arXiv Detail & Related papers (2022-11-23T16:16:52Z) - EmpBot: A T5-based Empathetic Chatbot focusing on Sentiments [75.11753644302385]
Empathetic conversational agents should not only understand what is being discussed, but also acknowledge the implied feelings of the conversation partner.
We propose a method based on a transformer pretrained language model (T5)
We evaluate our model on the EmpatheticDialogues dataset using both automated metrics and human evaluation.
arXiv Detail & Related papers (2021-10-30T19:04:48Z) - Joint Inference of States, Robot Knowledge, and Human (False-)Beliefs [90.20235972293801]
Aiming to understand how human (false-temporal)-belief-a core socio-cognitive ability unify-would affect human interactions with robots, this paper proposes to adopt a graphical model to the representation of object states, robot knowledge, and human (false-)beliefs.
An inference algorithm is derived to fuse individual pg from all robots across multi-views into a joint pg, which affords more effective reasoning inference capability to overcome the errors originated from a single view.
arXiv Detail & Related papers (2020-04-25T23:02:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.