Related papers: An Approach to Grounding AI Model Evaluations in Human-derived Criteria

An Approach to Grounding AI Model Evaluations in Human-derived Criteria

URL: http://arxiv.org/abs/2509.04676v1
Date: Thu, 04 Sep 2025 21:40:32 GMT
Title: An Approach to Grounding AI Model Evaluations in Human-derived Criteria
Authors: Sasha Mitts,
Abstract summary: We propose a novel approach to augment existing benchmarks with human-derived evaluation criteria.<n>Grounding our study in the Perception Test and OpenEQA benchmarks, we conducted in-depth interviews and large-scale surveys.<n>Our findings reveal that participants perceive AI as lacking in interpretive and empathetic skills yet hold high expectations for AI performance.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the rapidly evolving field of artificial intelligence (AI), traditional benchmarks can fall short in attempting to capture the nuanced capabilities of AI models. We focus on the case of physical world modeling and propose a novel approach to augment existing benchmarks with human-derived evaluation criteria, aiming to enhance the interpretability and applicability of model behaviors. Grounding our study in the Perception Test and OpenEQA benchmarks, we conducted in-depth interviews and large-scale surveys to identify key cognitive skills, such as Prioritization, Memorizing, Discerning, and Contextualizing, that are critical for both AI and human reasoning. Our findings reveal that participants perceive AI as lacking in interpretive and empathetic skills yet hold high expectations for AI performance. By integrating insights from our findings into benchmark design, we offer a framework for developing more human-aligned means of defining and measuring progress. This work underscores the importance of user-centered evaluation in AI development, providing actionable guidelines for researchers and practitioners aiming to align AI capabilities with human cognitive processes. Our approach both enhances current benchmarking practices and sets the stage for future advancements in AI model evaluation.

Related papers

Rethinking AI Evaluation in Education: The TEACH-AI Framework and Benchmark for Generative AI Assistants [8.591535882390918]
TEACH-AI is a domain-independent, pedagogically grounded, and stakeholder-aligned framework for guiding the design, development, and evaluation of generative AI systems in education.<n>Our work invites the community to reconsider what constructs "effective" AI in education and to design model evaluation approaches that promote co-creation, inclusivity, and long-term human, social, and educational impact.
arXiv Detail & Related papers (2025-11-28T17:42:36Z)
AI as Cognitive Amplifier: Rethinking Human Judgment in the Age of Generative AI [0.65268245109828]
I propose a three-level model of AI engagement.<n>I argue that the transition between levels requires not technical training but development of domain expertise and metacognitive skills.
arXiv Detail & Related papers (2025-10-30T11:55:34Z)
Bhatt Conjectures: On Necessary-But-Not-Sufficient Benchmark Tautology for Human Like Reasoning [0.0]
Bhatt Conjectures framework introduces rigorous, hierarchical benchmarks for evaluating AI reasoning and understanding.<n>Agentreasoning-sdk demonstrates practical implementation, revealing that current AI models struggle with complex reasoning tasks.
arXiv Detail & Related papers (2025-06-13T02:41:18Z)
The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z)
Methodological Foundations for AI-Driven Survey Question Generation [41.94295877935867]
This paper presents a methodological framework for using generative AI in educational survey research.<n>We explore how Large Language Models can generate adaptive, context-aware survey questions.<n>We examine ethical issues such as bias, privacy, and transparency.
arXiv Detail & Related papers (2025-05-02T09:50:34Z)
General Scales Unlock AI Evaluation with Explanatory and Predictive Power [57.7995945974989]
benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems.<n>We introduce general scales for AI evaluation that can explain what common AI benchmarks really measure.<n>Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate.
arXiv Detail & Related papers (2025-03-09T01:13:56Z)
On Benchmarking Human-Like Intelligence in Machines [77.55118048492021]
We argue that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities.<n>We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks.
arXiv Detail & Related papers (2025-02-27T20:21:36Z)
Developmental Support Approach to AI's Autonomous Growth: Toward the Realization of a Mutually Beneficial Stage Through Experiential Learning [0.0]
This study proposes an "AI Development Support" approach that supports the ethical development of AI itself.<n>We have constructed a learning framework based on a cycle of experience, introspection, analysis, and hypothesis formation.
arXiv Detail & Related papers (2025-02-27T06:12:20Z)
Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges. We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow. We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z)
Integration of cognitive tasks into artificial general intelligence test for large models [54.72053150920186]
We advocate for a comprehensive framework of cognitive science-inspired artificial general intelligence (AGI) tests. The cognitive science-inspired AGI tests encompass the full spectrum of intelligence facets, including crystallized intelligence, fluid intelligence, social intelligence, and embodied intelligence.
arXiv Detail & Related papers (2024-02-04T15:50:42Z)
Exploration with Principles for Diverse AI Supervision [88.61687950039662]
Training large transformers using next-token prediction has given rise to groundbreaking advancements in AI. While this generative AI approach has produced impressive results, it heavily leans on human supervision. This strong reliance on human oversight poses a significant hurdle to the advancement of AI innovation. We propose a novel paradigm termed Exploratory AI (EAI) aimed at autonomously generating high-quality training data.
arXiv Detail & Related papers (2023-10-13T07:03:39Z)
Evaluating and Improving Value Judgments in AI: A Scenario-Based Study on Large Language Models' Depiction of Social Conventions [5.457150493905063]
We evaluate how contemporary AI services competitively meet user needs, then examined society's depiction as mirrored by Large Language Models. We suggest a model of decision-making in value-conflicting scenarios which could be adopted for future machine value judgments. This paper advocates for a practical approach to using AI as a tool for investigating other remote worlds.
arXiv Detail & Related papers (2023-10-04T08:42:02Z)
An interdisciplinary conceptual study of Artificial Intelligence (AI) for helping benefit-risk assessment practices: Towards a comprehensive qualification matrix of AI programs and devices (pre-print 2020) [55.41644538483948]
This paper proposes a comprehensive analysis of existing concepts coming from different disciplines tackling the notion of intelligence. The aim is to identify shared notions or discrepancies to consider for qualifying AI systems.
arXiv Detail & Related papers (2021-05-07T12:01:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.