AI Judges in Design: Statistical Perspectives on Achieving Human Expert Equivalence With Vision-Language Models
- URL: http://arxiv.org/abs/2504.00938v1
- Date: Tue, 01 Apr 2025 16:20:29 GMT
- Title: AI Judges in Design: Statistical Perspectives on Achieving Human Expert Equivalence With Vision-Language Models
- Authors: Kristen M. Edwards, Farnaz Tehranchi, Scarlett R. Miller, Faez Ahmed,
- Abstract summary: This paper introduces a rigorous statistical framework to determine whether an AI judge's ratings match those of human experts.<n>We apply this framework in a case study evaluating four VLM-based judges on key design metrics.<n>Results show that the top-performing AI judge achieves expert-level agreement for uniqueness and drawing quality.
- Score: 3.092385483349516
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The subjective evaluation of early stage engineering designs, such as conceptual sketches, traditionally relies on human experts. However, expert evaluations are time-consuming, expensive, and sometimes inconsistent. Recent advances in vision-language models (VLMs) offer the potential to automate design assessments, but it is crucial to ensure that these AI ``judges'' perform on par with human experts. However, no existing framework assesses expert equivalence. This paper introduces a rigorous statistical framework to determine whether an AI judge's ratings match those of human experts. We apply this framework in a case study evaluating four VLM-based judges on key design metrics (uniqueness, creativity, usefulness, and drawing quality). These AI judges employ various in-context learning (ICL) techniques, including uni- vs. multimodal prompts and inference-time reasoning. The same statistical framework is used to assess three trained novices for expert-equivalence. Results show that the top-performing AI judge, using text- and image-based ICL with reasoning, achieves expert-level agreement for uniqueness and drawing quality and outperforms or matches trained novices across all metrics. In 6/6 runs for both uniqueness and creativity, and 5/6 runs for both drawing quality and usefulness, its agreement with experts meets or exceeds that of the majority of trained novices. These findings suggest that reasoning-supported VLM models can achieve human-expert equivalence in design evaluation. This has implications for scaling design evaluation in education and practice, and provides a general statistical framework for validating AI judges in other domains requiring subjective content evaluation.
Related papers
- PhyCritic: Multimodal Critic Models for Physical AI [101.37916322714041]
We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline.<n>We show that PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.
arXiv Detail & Related papers (2026-02-11T18:35:39Z) - Empowering Affected Individuals to Shape AI Fairness Assessments: Processes, Criteria, and Tools [5.72357951997548]
Existing fairness assessments are typically conducted by AI experts or regulators using protected attributes and metrics.<n>Recent work has called for involving affected individuals in fairness assessment, yet little empirical evidence exists on how they create their own fairness criteria.
arXiv Detail & Related papers (2026-01-27T12:51:01Z) - Practitioner Insights on Fairness Requirements in the AI Development Life Cycle: An Interview Study [3.5429774642987915]
We conducted research on fairness requirements in AI from software engineering perspective.<n>Our study assesses the participants' awareness of fairness in AI / ML software and its application within the Software Development Life Cycle (SDLC)<n>Findings show that while our participants recognize the aforementioned AI fairness dimensions, practices are inconsistent, and fairness is often deprioritized with noticeable knowledge gaps.
arXiv Detail & Related papers (2025-12-15T19:12:34Z) - An Approach to Grounding AI Model Evaluations in Human-derived Criteria [0.0]
We propose a novel approach to augment existing benchmarks with human-derived evaluation criteria.<n>Grounding our study in the Perception Test and OpenEQA benchmarks, we conducted in-depth interviews and large-scale surveys.<n>Our findings reveal that participants perceive AI as lacking in interpretive and empathetic skills yet hold high expectations for AI performance.
arXiv Detail & Related papers (2025-09-04T21:40:32Z) - Expert Preference-based Evaluation of Automated Related Work Generation [54.29459509574242]
We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences.<n>For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs.
arXiv Detail & Related papers (2025-08-11T13:08:07Z) - Teaching at Scale: Leveraging AI to Evaluate and Elevate Engineering Education [3.557803321422781]
This article presents a scalable, AI-supported framework for qualitative student feedback using large language models.<n>The system employs hierarchical summarization, anonymization, and exception handling to extract actionable themes from open-ended comments.<n>We report on its successful deployment across a large college of engineering.
arXiv Detail & Related papers (2025-08-01T20:27:40Z) - Quantitative LLM Judges [48.676042957523045]
We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to human scores in a given domain.<n>The models are trained to improve the score of the original judge by using the judge's textual evaluation and score.<n>Our experiments show that quantitative judges can effectively improve the predictive power of existing judges through post-hoc modeling.
arXiv Detail & Related papers (2025-06-03T14:44:23Z) - Measurement to Meaning: A Validity-Centered Framework for AI Evaluation [12.55408229639344]
We provide a structured approach for reasoning about the types of evaluative claims that can be made given the available evidence.<n>Our framework is well-suited for the contemporary paradigm in machine learning.
arXiv Detail & Related papers (2025-05-13T20:36:22Z) - HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)<n>In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.<n>We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z) - Fairness Evaluation with Item Response Theory [10.871079276188649]
This paper proposes a novel Fair-IRT framework to evaluate fairness in Machine Learning (ML) models.
Detailed explanations for item characteristic curves (ICCs) are provided for particular individuals.
Experiments demonstrate the effectiveness of this framework as a fairness evaluation tool.
arXiv Detail & Related papers (2024-10-20T22:25:20Z) - How critically can an AI think? A framework for evaluating the quality of thinking of generative artificial intelligence [0.9671462473115854]
Generative AI such as those with large language models have created opportunities for innovative assessment design practices.
This paper presents a framework that explores the capabilities of the LLM ChatGPT4 application, which is the current industry benchmark.
This critique will provide specific and targeted indications of their questions vulnerabilities in terms of the critical thinking skills.
arXiv Detail & Related papers (2024-06-20T22:46:56Z) - ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking.
We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert.
We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z) - Evaluating General-Purpose AI with Psychometrics [43.85432514910491]
We discuss the need for a comprehensive and accurate evaluation of general-purpose AI systems such as large language models.
Current evaluation methodology, mostly based on benchmarks of specific tasks, falls short of adequately assessing these versatile AI systems.
To tackle these challenges, we suggest transitioning from task-oriented evaluation to construct-oriented evaluation.
arXiv Detail & Related papers (2023-10-25T05:38:38Z) - ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models [122.63704560157909]
We introduce AGIEval, a novel benchmark designed to assess foundation model in the context of human-centric standardized exams.
We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003.
GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam.
arXiv Detail & Related papers (2023-04-13T09:39:30Z) - From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic
Review on Evaluating Explainable AI [3.7592122147132776]
We identify 12 conceptual properties, such as Compactness and Correctness, that should be evaluated for comprehensively assessing the quality of an explanation.
We find that 1 in 3 papers evaluate exclusively with anecdotal evidence, and 1 in 5 papers evaluate with users.
This systematic collection of evaluation methods provides researchers and practitioners with concrete tools to thoroughly validate, benchmark and compare new and existing XAI methods.
arXiv Detail & Related papers (2022-01-20T13:23:20Z) - Image Quality Assessment in the Modern Age [53.19271326110551]
This tutorial provides the audience with the basic theories, methodologies, and current progresses of image quality assessment (IQA)
We will first revisit several subjective quality assessment methodologies, with emphasis on how to properly select visual stimuli.
Both hand-engineered and (deep) learning-based methods will be covered.
arXiv Detail & Related papers (2021-10-19T02:38:46Z) - Leveraging Expert Consistency to Improve Algorithmic Decision Support [62.61153549123407]
We explore the use of historical expert decisions as a rich source of information that can be combined with observed outcomes to narrow the construct gap.
We propose an influence function-based methodology to estimate expert consistency indirectly when each case in the data is assessed by a single expert.
Our empirical evaluation, using simulations in a clinical setting and real-world data from the child welfare domain, indicates that the proposed approach successfully narrows the construct gap.
arXiv Detail & Related papers (2021-01-24T05:40:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.