Related papers: Preliminary suggestions for rigorous GPAI model evaluations

Preliminary suggestions for rigorous GPAI model evaluations

URL: http://arxiv.org/abs/2508.00875v1
Date: Tue, 22 Jul 2025 03:27:42 GMT
Title: Preliminary suggestions for rigorous GPAI model evaluations
Authors: Patricia Paskov, Michael J. Byun, Kevin Wei, Toby Webster,
Abstract summary: This document presents a preliminary compilation of general-purpose AI (GPAI) evaluation practices.<n>It includes suggestions for human uplift studies and benchmark evaluations.<n>Suggestions are organised across four stages in the evaluation life cycle: design, implementation, execution and documentation.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This document presents a preliminary compilation of general-purpose AI (GPAI) evaluation practices that may promote internal validity, external validity and reproducibility. It includes suggestions for human uplift studies and benchmark evaluations, as well as cross-cutting suggestions that may apply to many different evaluation types. Suggestions are organised across four stages in the evaluation life cycle: design, implementation, execution and documentation. Drawing from established practices in machine learning, statistics, psychology, economics, biology and other fields recognised to have important lessons for AI evaluation, these suggestions seek to contribute to the conversation on the nascent and evolving field of the science of GPAI evaluations. The intended audience of this document includes providers of GPAI models presenting systemic risk (GPAISR), for whom the EU AI Act lays out specific evaluation requirements; third-party evaluators; policymakers assessing the rigour of evaluations; and academic researchers developing or conducting GPAI evaluations.

Related papers

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem [87.30601926271864]
InnoEval is a deep innovation evaluation framework designed to emulate human-level idea assessment.<n>We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources.<n>We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval.
arXiv Detail & Related papers (2026-02-16T00:40:31Z)
Beyond the Binary: The System of All-round Evaluation of Research and Its Practices in China [3.6998581528902625]
This paper introduces the System of All-round Evaluation of Research (SAER), a framework that integrates form, content, and utility evaluations with six key elements.<n>The comprehensive system proposes a trinity of three evaluation dimensions, combined with six evaluation elements, which would help academic evaluators and researchers reconcile binary oppositions in evaluation methods.
arXiv Detail & Related papers (2025-09-10T12:52:08Z)
AI Testing Should Account for Sophisticated Strategic Behaviour [19.554240127749818]
This position paper argues for two claims regarding AI testing and evaluation.<n>First, evaluations need account for the possibility that AI systems understand their circumstances and reason strategically.<n>Second, game-theoretic analysis can inform evaluation design by formalising and scrutinising the reasoning in evaluation-based safety cases.
arXiv Detail & Related papers (2025-08-19T15:48:25Z)
SPHERE: An Evaluation Card for Human-AI Systems [75.0887588648484]
We present an evaluation card SPHERE, which encompasses five key dimensions.<n>We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement.
arXiv Detail & Related papers (2025-03-24T20:17:20Z)
ReviewEval: An Evaluation Framework for AI-Generated Reviews [9.35023998408983]
The escalating volume of academic research, coupled with a shortage of qualified reviewers, necessitates innovative approaches to peer review.<n>We propose ReviewEval, a comprehensive evaluation framework for AI-generated reviews that measures alignment with human assessments, verifies factual accuracy, assesses analytical depth, identifies degree of constructiveness and adherence to reviewer guidelines.<n>This paper establishes essential metrics for AIbased peer review and substantially enhances the reliability and impact of AI-generated reviews in academic research.
arXiv Detail & Related papers (2025-02-17T12:22:11Z)
HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)<n>In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.<n>We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z)
Standing on FURM ground -- A framework for evaluating Fair, Useful, and Reliable AI Models in healthcare systems [6.305990032645096]
Stanford Health Care has developed a Testing and Evaluation mechanism to identify fair, useful and reliable AI models. We describe the assessment process, summarize the six assessments, and share our framework to enable others to conduct similar assessments. Our novel contributions - usefulness estimates by simulation, financial projections to quantify sustainability, and a process to do ethical assessments - are available for other healthcare systems to conduct actionable evaluations of candidate AI solutions.
arXiv Detail & Related papers (2024-02-27T03:33:40Z)
A Literature Review of Literature Reviews in Pattern Analysis and Machine Intelligence [55.33653554387953]
Pattern Analysis and Machine Intelligence (PAMI) has led to numerous literature reviews aimed at collecting and fragmented information.<n>This paper presents a thorough analysis of these literature reviews within the PAMI field.<n>We try to address three core research questions: (1) What are the prevalent structural and statistical characteristics of PAMI literature reviews; (2) What strategies can researchers employ to efficiently navigate the growing corpus of reviews; and (3) What are the advantages and limitations of AI-generated reviews compared to human-authored ones.
arXiv Detail & Related papers (2024-02-20T11:28:50Z)
Evaluation in Neural Style Transfer: A Review [0.7614628596146599]
We provide an in-depth analysis of existing evaluation techniques, identify the inconsistencies and limitations of current evaluation methods, and give recommendations for standardized evaluation practices. We believe that the development of a robust evaluation framework will not only enable more meaningful and fairer comparisons but will also enhance the comprehension and interpretation of research findings in the field.
arXiv Detail & Related papers (2024-01-30T15:45:30Z)
Towards a Comprehensive Human-Centred Evaluation Framework for Explainable AI [1.7222662622390634]
We propose to adapt the User-Centric Evaluation Framework used in recommender systems. We integrate explanation aspects, summarise explanation properties, indicate relations between them, and categorise metrics that measure these properties.
arXiv Detail & Related papers (2023-07-31T09:20:16Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale. We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units. We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)
Interpretable Off-Policy Evaluation in Reinforcement Learning by Highlighting Influential Transitions [48.91284724066349]
Off-policy evaluation in reinforcement learning offers the chance of using observational data to improve future outcomes in domains such as healthcare and education. Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding. We develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates.
arXiv Detail & Related papers (2020-02-10T00:26:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.