On Meta-Evaluation
- URL: http://arxiv.org/abs/2601.14262v1
- Date: Thu, 27 Nov 2025 06:31:16 GMT
- Title: On Meta-Evaluation
- Authors: Hongxiao Li, Chenxi Wang, Fanda Fan, Zihan Wang, Wanling Gao, Lei Wang, Jianfeng Zhan,
- Abstract summary: Methods such as observational studies, design of experiments (DoE), and randomized controlled trials (RCTs) have shaped modern scientific practice.<n>We introduce a formal framework for meta-evaluation by defining the evaluation space, its structured representation, and a benchmark we call AxiaBench.<n>AxiaBench enables the first large-scale, quantitative comparison of ten widely used evaluation methods across eight representative application domains.
- Score: 11.44925492599594
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Evaluation is the foundation of empirical science, yet the evaluation of evaluation itself -- so-called meta-evaluation -- remains strikingly underdeveloped. While methods such as observational studies, design of experiments (DoE), and randomized controlled trials (RCTs) have shaped modern scientific practice, there has been little systematic inquiry into their comparative validity and utility across domains. Here we introduce a formal framework for meta-evaluation by defining the evaluation space, its structured representation, and a benchmark we call AxiaBench. AxiaBench enables the first large-scale, quantitative comparison of ten widely used evaluation methods across eight representative application domains. Our analysis reveals a fundamental limitation: no existing method simultaneously achieves accuracy and efficiency across diverse scenarios, with DoE and observational designs in particular showing significant deviations from real-world ground truth. We further evaluate a unified method of entire-space stratified sampling from previous evaluatology research, and the results report that it consistently outperforms prior approaches across all tested domains. These results establish meta-evaluation as a scientific object in its own right and provide both a conceptual foundation and a pragmatic tool set for advancing trustworthy evaluation in computational and experimental research.
Related papers
- The Validity of Coreference-based Evaluations of Natural Language Understanding [3.505146496638911]
I analyze standard coreference evaluations and show that their design often leads to non-generalizable conclusions.<n>I propose and implement a novel evaluation focused on testing systems' ability to infer the relative plausibility of events.
arXiv Detail & Related papers (2026-02-18T05:49:28Z) - InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem [87.30601926271864]
InnoEval is a deep innovation evaluation framework designed to emulate human-level idea assessment.<n>We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources.<n>We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval.
arXiv Detail & Related papers (2026-02-16T00:40:31Z) - The Benchmarking Epistemology: Construct Validity for Evaluating Machine Learning Models [1.1315617886931963]
We develop conditions of construct validity inspired by psychological measurement theory.<n>We examine these assumptions in practice through three case studies.<n>Our framework clarifies conditions under which benchmark scores can support diverse scientific claims.
arXiv Detail & Related papers (2025-10-27T10:30:30Z) - PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning [57.868248683256574]
PRISM-Physics is a process-level evaluation framework and benchmark for complex physics reasoning problems.<n> Solutions are represented as directed acyclic graphs (DAGs) of formulas.<n>Results show that our evaluation framework is aligned with human experts' scoring.
arXiv Detail & Related papers (2025-10-03T17:09:03Z) - MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback [136.27567671480156]
We introduce experiment-guided ranking, which prioritizes hypotheses based on feedback from prior tests.<n>We frame experiment-guided ranking as a sequential decision-making problem.<n>Our approach significantly outperforms pre-experiment baselines and strong ablations.
arXiv Detail & Related papers (2025-05-23T13:24:50Z) - Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework [61.38174427966444]
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios.<n>Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models.<n>We propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses.
arXiv Detail & Related papers (2025-02-26T06:31:45Z) - Make Full Use of Testing Information: An Integrated Accelerated Testing and Evaluation Method for Autonomous Driving Systems [6.065650382599096]
This paper proposes an Integrated accelerated Testing and Evaluation Method (ITEM) for testing and evaluation of autonomous driving systems (ADSs)<n>To make full use of testing information, this paper proposes an Integrated accelerated Testing and Evaluation Method (ITEM)<n>The experimental results show that ITEM could well identify the hazardous domains in both low- and high-dimensional cases, regardless of the shape of the hazardous domains.
arXiv Detail & Related papers (2025-01-21T06:59:25Z) - Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework [2.4861619769660637]
We propose an estimands framework adapted from international clinical trials guidelines.
This framework provides a systematic structure for inference and reporting in evaluations.
We demonstrate how the framework can help uncover underlying issues, their causes, and potential solutions.
arXiv Detail & Related papers (2024-06-14T18:47:37Z) - Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition [70.60872754129832]
First NeurIPS competition on unlearning sought to stimulate the development of novel algorithms.
Nearly 1,200 teams from across the world participated.
We analyze top solutions and delve into discussions on benchmarking unlearning.
arXiv Detail & Related papers (2024-06-13T12:58:00Z) - Evaluatology: The Science and Engineering of Evaluation [11.997673313601423]
This article aims to formally introduce the discipline of evaluatology, which encompasses the science and engineering of evaluation.
We propose a universal framework for evaluation, encompassing concepts, terminologies, theories, and methodologies that can be applied across various disciplines.
arXiv Detail & Related papers (2024-03-19T13:38:26Z) - Evaluation in Neural Style Transfer: A Review [0.7614628596146599]
We provide an in-depth analysis of existing evaluation techniques, identify the inconsistencies and limitations of current evaluation methods, and give recommendations for standardized evaluation practices.
We believe that the development of a robust evaluation framework will not only enable more meaningful and fairer comparisons but will also enhance the comprehension and interpretation of research findings in the field.
arXiv Detail & Related papers (2024-01-30T15:45:30Z) - Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.