Human or Machine? Turing Tests for Vision and Language
- URL: http://arxiv.org/abs/2211.13087v1
- Date: Wed, 23 Nov 2022 16:16:52 GMT
- Title: Human or Machine? Turing Tests for Vision and Language
- Authors: Mengmi Zhang, Giorgia Dellaferrera, Ankur Sikarwar, Marcelo
Armendariz, Noga Mudrik, Prachi Agrawal, Spandan Madan, Andrei Barbu, Haochen
Yang, Tanishq Kumar, Meghna Sadwani, Stella Dellaferrera, Michele Pizzochero,
Hanspeter Pfister, Gabriel Kreiman
- Abstract summary: We systematically benchmark current AIs in their abilities to imitate humans.
Experiments involved testing 769 human agents, 24 state-of-the-art AI agents, 896 human judges, and 8 AI judges.
Results reveal that current AIs are not far from being able to impersonate human judges across different genders, ages, and educational levels.
- Score: 22.110556671410624
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As AI algorithms increasingly participate in daily activities that used to be
the sole province of humans, we are inevitably called upon to consider how much
machines are really like us. To address this question, we turn to the Turing
test and systematically benchmark current AIs in their abilities to imitate
humans. We establish a methodology to evaluate humans versus machines in
Turing-like tests and systematically evaluate a representative set of selected
domains, parameters, and variables. The experiments involved testing 769 human
agents, 24 state-of-the-art AI agents, 896 human judges, and 8 AI judges, in
21,570 Turing tests across 6 tasks encompassing vision and language modalities.
Surprisingly, the results reveal that current AIs are not far from being able
to impersonate human judges across different ages, genders, and educational
levels in complex visual and language challenges. In contrast, simple AI judges
outperform human judges in distinguishing human answers versus machine answers.
The curated large-scale Turing test datasets introduced here and their
evaluation metrics provide valuable insights to assess whether an agent is
human or not. The proposed formulation to benchmark human imitation ability in
current AIs paves a way for the research community to expand Turing tests to
other research areas and conditions. All of source code and data are publicly
available at https://tinyurl.com/8x8nha7p
Related papers
- HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation [50.616995671367704]
We present a high-dimensional, simulated robot learning benchmark, HumanoidBench, featuring a humanoid robot equipped with dexterous hands.
Our findings reveal that state-of-the-art reinforcement learning algorithms struggle with most tasks, whereas a hierarchical learning approach achieves superior performance when supported by robust low-level policies.
arXiv Detail & Related papers (2024-03-15T17:45:44Z) - Bending the Automation Bias Curve: A Study of Human and AI-based
Decision Making in National Security Contexts [0.0]
We theorize about the relationship between background knowledge about AI, trust in AI, and how these interact with other factors to influence the probability of automation bias.
We test these in a preregistered task identification experiment across a representative sample of 9000 adults in 9 countries with varying levels of AI industries.
arXiv Detail & Related papers (2023-06-28T18:57:36Z) - Navigates Like Me: Understanding How People Evaluate Human-Like AI in
Video Games [36.96985093527702]
We collect hundreds of crowd-sourced assessments comparing the human-likeness of navigation behavior generated by our agent and baseline AI agents.
Our proposed agent passes a Turing Test, while the baseline agents do not.
This work provides insights into the characteristics that people consider human-like in the context of goal-directed video game navigation.
arXiv Detail & Related papers (2023-03-02T18:59:04Z) - Human Heuristics for AI-Generated Language Are Flawed [8.465228064780744]
We study whether verbal self-presentations, one of the most personal and consequential forms of language, were generated by AI.
We experimentally demonstrate that these wordings make human judgment of AI-generated language predictable and manipulable.
We discuss solutions, such as AI accents, to reduce the deceptive potential of language generated by AI.
arXiv Detail & Related papers (2022-06-15T03:18:56Z) - Metaethical Perspectives on 'Benchmarking' AI Ethics [81.65697003067841]
Benchmarks are seen as the cornerstone for measuring technical progress in Artificial Intelligence (AI) research.
An increasingly prominent research area in AI is ethics, which currently has no set of benchmarks nor commonly accepted way for measuring the 'ethicality' of an AI system.
We argue that it makes more sense to talk about 'values' rather than 'ethics' when considering the possible actions of present and future AI systems.
arXiv Detail & Related papers (2022-04-11T14:36:39Z) - Cybertrust: From Explainable to Actionable and Interpretable AI (AI2) [58.981120701284816]
Actionable and Interpretable AI (AI2) will incorporate explicit quantifications and visualizations of user confidence in AI recommendations.
It will allow examining and testing of AI system predictions to establish a basis for trust in the systems' decision making.
arXiv Detail & Related papers (2022-01-26T18:53:09Z) - A User-Centred Framework for Explainable Artificial Intelligence in
Human-Robot Interaction [70.11080854486953]
We propose a user-centred framework for XAI that focuses on its social-interactive aspect.
The framework aims to provide a structure for interactive XAI solutions thought for non-expert users.
arXiv Detail & Related papers (2021-09-27T09:56:23Z) - Trustworthy AI: A Computational Perspective [54.80482955088197]
We focus on six of the most crucial dimensions in achieving trustworthy AI: (i) Safety & Robustness, (ii) Non-discrimination & Fairness, (iii) Explainability, (iv) Privacy, (v) Accountability & Auditability, and (vi) Environmental Well-Being.
For each dimension, we review the recent related technologies according to a taxonomy and summarize their applications in real-world systems.
arXiv Detail & Related papers (2021-07-12T14:21:46Z) - The MineRL BASALT Competition on Learning from Human Feedback [58.17897225617566]
The MineRL BASALT competition aims to spur forward research on this important class of techniques.
We design a suite of four tasks in Minecraft for which we expect it will be hard to write down hardcoded reward functions.
We provide a dataset of human demonstrations on each of the four tasks, as well as an imitation learning baseline.
arXiv Detail & Related papers (2021-07-05T12:18:17Z) - A Definition and a Test for Human-Level Artificial Intelligence [1.3140673348778702]
Humans can update the action-value function with the verbal description as if they experience states, actions, and corresponding rewards sequences firsthand.
We present a classification of intelligence according to how individual agents learn and propose a definition and a test for HLAI.
arXiv Detail & Related papers (2020-11-18T17:10:02Z) - Human Evaluation of Interpretability: The Case of AI-Generated Music
Knowledge [19.508678969335882]
We focus on evaluating AI-discovered knowledge/rules in the arts and humanities.
We present an experimental procedure to collect and assess human-generated verbal interpretations of AI-generated music theory/rules rendered as sophisticated symbolic/numeric objects.
arXiv Detail & Related papers (2020-04-15T06:03:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.