Human or Machine? Turing Tests for Vision and Language
- URL: http://arxiv.org/abs/2211.13087v1
- Date: Wed, 23 Nov 2022 16:16:52 GMT
- Title: Human or Machine? Turing Tests for Vision and Language
- Authors: Mengmi Zhang, Giorgia Dellaferrera, Ankur Sikarwar, Marcelo
Armendariz, Noga Mudrik, Prachi Agrawal, Spandan Madan, Andrei Barbu, Haochen
Yang, Tanishq Kumar, Meghna Sadwani, Stella Dellaferrera, Michele Pizzochero,
Hanspeter Pfister, Gabriel Kreiman
- Abstract summary: We systematically benchmark current AIs in their abilities to imitate humans.
Experiments involved testing 769 human agents, 24 state-of-the-art AI agents, 896 human judges, and 8 AI judges.
Results reveal that current AIs are not far from being able to impersonate human judges across different genders, ages, and educational levels.
- Score: 22.110556671410624
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As AI algorithms increasingly participate in daily activities that used to be
the sole province of humans, we are inevitably called upon to consider how much
machines are really like us. To address this question, we turn to the Turing
test and systematically benchmark current AIs in their abilities to imitate
humans. We establish a methodology to evaluate humans versus machines in
Turing-like tests and systematically evaluate a representative set of selected
domains, parameters, and variables. The experiments involved testing 769 human
agents, 24 state-of-the-art AI agents, 896 human judges, and 8 AI judges, in
21,570 Turing tests across 6 tasks encompassing vision and language modalities.
Surprisingly, the results reveal that current AIs are not far from being able
to impersonate human judges across different ages, genders, and educational
levels in complex visual and language challenges. In contrast, simple AI judges
outperform human judges in distinguishing human answers versus machine answers.
The curated large-scale Turing test datasets introduced here and their
evaluation metrics provide valuable insights to assess whether an agent is
human or not. The proposed formulation to benchmark human imitation ability in
current AIs paves a way for the research community to expand Turing tests to
other research areas and conditions. All of source code and data are publicly
available at https://tinyurl.com/8x8nha7p
Related papers
- Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead [2.809966405091883]
We argue that interpreting benchmark performance as measurements of human-like traits, lacks sufficient theoretical and empirical justification.<n>We call for the development of principled, AI-specific evaluation frameworks tailored to AI systems.
arXiv Detail & Related papers (2025-07-30T18:14:35Z) - On Benchmarking Human-Like Intelligence in Machines [77.55118048492021]
We argue that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities.
We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks.
arXiv Detail & Related papers (2025-02-27T20:21:36Z) - Almost AI, Almost Human: The Challenge of Detecting AI-Polished Writing [55.2480439325792]
Misclassification can lead to false plagiarism accusations and misleading claims about AI prevalence in online content.
We systematically evaluate eleven state-of-the-art AI-text detectors using our AI-Polished-Text Evaluation dataset.
Our findings reveal that detectors frequently misclassify even minimally polished text as AI-generated, struggle to differentiate between degrees of AI involvement, and exhibit biases against older and smaller models.
arXiv Detail & Related papers (2025-02-21T18:45:37Z) - AI-Driven Agents with Prompts Designed for High Agreeableness Increase the Likelihood of Being Mistaken for a Human in the Turing Test [0.0]
GPT agents with varying levels of agreeableness were tested in a Turing Test.
All exceeded a 50% confusion rate, with the highly agreeable AI agent surpassing 60%.
This agent was also recognized as exhibiting the most human-like traits.
arXiv Detail & Related papers (2024-11-20T23:12:49Z) - Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA [43.116608441891096]
Humans outperform AI systems in knowledge-grounded abductive and conceptual reasoning.
State-of-the-art LLMs like GPT-4 and LLaMA show superior performance on targeted information retrieval.
arXiv Detail & Related papers (2024-10-09T03:53:26Z) - Human Bias in the Face of AI: The Role of Human Judgement in AI Generated Text Evaluation [48.70176791365903]
This study explores how bias shapes the perception of AI versus human generated content.
We investigated how human raters respond to labeled and unlabeled content.
arXiv Detail & Related papers (2024-09-29T04:31:45Z) - Rolling in the deep of cognitive and AI biases [1.556153237434314]
We argue that there is urgent need to understand AI as a sociotechnical system, inseparable from the conditions in which it is designed, developed and deployed.
We address this critical issue by following a radical new methodology under which human cognitive biases become core entities in our AI fairness overview.
We introduce a new mapping, which justifies the humans to AI biases and we detect relevant fairness intensities and inter-dependencies.
arXiv Detail & Related papers (2024-07-30T21:34:04Z) - Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities [48.922660354417204]
We propose Automated Dynamic Evaluation (AutoDE) to assess an assistant's API call capability without human involvement.
In our framework, we endeavor to closely mirror genuine human conversation patterns in human-machine interactions.
arXiv Detail & Related papers (2024-03-17T07:34:12Z) - The Role of AI in Drug Discovery: Challenges, Opportunities, and
Strategies [97.5153823429076]
The benefits, challenges and drawbacks of AI in this field are reviewed.
The use of data augmentation, explainable AI, and the integration of AI with traditional experimental methods are also discussed.
arXiv Detail & Related papers (2022-12-08T23:23:39Z) - Human Heuristics for AI-Generated Language Are Flawed [8.465228064780744]
We study whether verbal self-presentations, one of the most personal and consequential forms of language, were generated by AI.
We experimentally demonstrate that these wordings make human judgment of AI-generated language predictable and manipulable.
We discuss solutions, such as AI accents, to reduce the deceptive potential of language generated by AI.
arXiv Detail & Related papers (2022-06-15T03:18:56Z) - Cybertrust: From Explainable to Actionable and Interpretable AI (AI2) [58.981120701284816]
Actionable and Interpretable AI (AI2) will incorporate explicit quantifications and visualizations of user confidence in AI recommendations.
It will allow examining and testing of AI system predictions to establish a basis for trust in the systems' decision making.
arXiv Detail & Related papers (2022-01-26T18:53:09Z) - Trustworthy AI: A Computational Perspective [54.80482955088197]
We focus on six of the most crucial dimensions in achieving trustworthy AI: (i) Safety & Robustness, (ii) Non-discrimination & Fairness, (iii) Explainability, (iv) Privacy, (v) Accountability & Auditability, and (vi) Environmental Well-Being.
For each dimension, we review the recent related technologies according to a taxonomy and summarize their applications in real-world systems.
arXiv Detail & Related papers (2021-07-12T14:21:46Z) - Human Evaluation of Interpretability: The Case of AI-Generated Music
Knowledge [19.508678969335882]
We focus on evaluating AI-discovered knowledge/rules in the arts and humanities.
We present an experimental procedure to collect and assess human-generated verbal interpretations of AI-generated music theory/rules rendered as sophisticated symbolic/numeric objects.
arXiv Detail & Related papers (2020-04-15T06:03:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.