Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead
- URL: http://arxiv.org/abs/2507.23009v1
- Date: Wed, 30 Jul 2025 18:14:35 GMT
- Title: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead
- Authors: Tom Sühr, Florian E. Dorner, Olawale Salaudeen, Augustin Kelava, Samira Samadi,
- Abstract summary: We argue that interpreting benchmark performance as measurements of human-like traits, lacks sufficient theoretical and empirical justification.<n>We call for the development of principled, AI-specific evaluation frameworks tailored to AI systems.
- Score: 2.809966405091883
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large Language Models (LLMs) have achieved remarkable results on a range of standardized tests originally designed to assess human cognitive and psychological traits, such as intelligence and personality. While these results are often interpreted as strong evidence of human-like characteristics in LLMs, this paper argues that such interpretations constitute an ontological error. Human psychological and educational tests are theory-driven measurement instruments, calibrated to a specific human population. Applying these tests to non-human subjects without empirical validation, risks mischaracterizing what is being measured. Furthermore, a growing trend frames AI performance on benchmarks as measurements of traits such as ``intelligence'', despite known issues with validity, data contamination, cultural bias and sensitivity to superficial prompt changes. We argue that interpreting benchmark performance as measurements of human-like traits, lacks sufficient theoretical and empirical justification. This leads to our position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead. We call for the development of principled, AI-specific evaluation frameworks tailored to AI systems. Such frameworks might build on existing frameworks for constructing and validating psychometrics tests, or could be created entirely from scratch to fit the unique context of AI.
Related papers
- Perceptual Quality Assessment for Embodied AI [66.96928199019129]
Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories.<n>There is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots.
arXiv Detail & Related papers (2025-05-22T15:51:07Z) - On Benchmarking Human-Like Intelligence in Machines [77.55118048492021]
We argue that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities.<n>We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks.
arXiv Detail & Related papers (2025-02-27T20:21:36Z) - Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments [0.0]
Despite the widespread use of multiple-choice questions in assessments, the detection of AI cheating has been almost unexplored.<n>We propose a method based on the application of Item Response Theory to address this gap.<n>Our approach operates on the assumption that artificial and human intelligence exhibit different response patterns.
arXiv Detail & Related papers (2024-11-28T09:43:06Z) - Human Bias in the Face of AI: The Role of Human Judgement in AI Generated Text Evaluation [48.70176791365903]
This study explores how bias shapes the perception of AI versus human generated content.
We investigated how human raters respond to labeled and unlabeled content.
arXiv Detail & Related papers (2024-09-29T04:31:45Z) - ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking.
We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert.
We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z) - Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - MAILS -- Meta AI Literacy Scale: Development and Testing of an AI
Literacy Questionnaire Based on Well-Founded Competency Models and
Psychological Change- and Meta-Competencies [6.368014180870025]
The questionnaire should be modular (i.e., including different facets that can be used independently of each other) to be flexibly applicable in professional life.
We derived 60 items to represent different facets of AI Literacy according to Ng and colleagues conceptualisation of AI literacy.
Additional 12 items to represent psychological competencies such as problem solving, learning, and emotion regulation in regard to AI.
arXiv Detail & Related papers (2023-02-18T12:35:55Z) - Can Machines Imitate Humans? Integrative Turing Tests for Vision and Language Demonstrate a Narrowing Gap [45.6806234490428]
We benchmark current AIs in their abilities to imitate humans in three language tasks and three vision tasks.
Experiments involved 549 human agents plus 26 AI agents for dataset creation, and 1,126 human judges plus 10 AI judges.
Results reveal that current AIs are not far from being able to impersonate humans in complex language and vision challenges.
arXiv Detail & Related papers (2022-11-23T16:16:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.