Benchmarking is Broken -- Don't Let AI be its Own Judge
- URL: http://arxiv.org/abs/2510.07575v2
- Date: Wed, 15 Oct 2025 05:58:21 GMT
- Title: Benchmarking is Broken -- Don't Let AI be its Own Judge
- Authors: Zerui Cheng, Stella Wohnig, Ruchika Gupta, Samiul Alam, Tassallah Abdullahi, João Alves Ribeiro, Christian Nielsen-Garcia, Saif Mir, Siran Li, Jason Orender, Seyed Ali Bahrainian, Daniel Kirste, Aaron Gokaslan, Mikołaj Glinka, Carsten Eickhoff, Ruben Wolff,
- Abstract summary: We argue that the current laissez-faire approach to evaluating AI is unsustainable.<n>We introduce PeerBench, a community-governed, proctored evaluation blueprint.<n>Our goal is to pave the way for evaluations that can restore integrity and deliver genuinely trustworthy measures of AI progress.
- Score: 22.93026946593552
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The meteoric rise of AI, with its rapidly expanding market capitalization, presents both transformative opportunities and critical challenges. Chief among these is the urgent need for a new, unified paradigm for trustworthy evaluation, as current benchmarks increasingly reveal critical vulnerabilities. Issues like data contamination and selective reporting by model developers fuel hype, while inadequate data quality control can lead to biased evaluations that, even if unintentionally, may favor specific approaches. As a flood of participants enters the AI space, this "Wild West" of assessment makes distinguishing genuine progress from exaggerated claims exceptionally difficult. Such ambiguity blurs scientific signals and erodes public confidence, much as unchecked claims would destabilize financial markets reliant on credible oversight from agencies like Moody's. In high-stakes human examinations (e.g., SAT, GRE), substantial effort is devoted to ensuring fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? This position paper argues that the current laissez-faire approach is unsustainable. We contend that true, sustainable AI advancement demands a paradigm shift: a unified, live, and quality-controlled benchmarking framework robust by construction, not by mere courtesy and goodwill. To this end, we dissect the systemic flaws undermining today's AI evaluation, distill the essential requirements for a new generation of assessments, and introduce PeerBench (with its prototype implementation at https://www.peerbench.ai/), a community-governed, proctored evaluation blueprint that embodies this paradigm through sealed execution, item banking with rolling renewal, and delayed transparency. Our goal is to pave the way for evaluations that can restore integrity and deliver genuinely trustworthy measures of AI progress.
Related papers
- Making LLMs Reliable When It Matters Most: A Five-Layer Architecture for High-Stakes Decisions [51.56484100374058]
Current large language models (LLMs) excel in verifiable domains where outputs can be checked before action but prove less reliable for high-stakes strategic decisions with uncertain outcomes.<n>This gap, driven by mutually cognitive biases in both humans and artificial intelligence (AI) systems, threatens the defensibility of valuations and sustainability of investments in the sector.<n>This report describes a framework emerging from systematic qualitative assessment across 7 frontier-grade LLMs and 3 market-facing venture vignettes under time pressure.
arXiv Detail & Related papers (2025-11-10T22:24:21Z) - Zero-shot reasoning for simulating scholarly peer-review [0.0]
We investigate a deterministic simulation framework that provides the first stable, evidence-based standard for evaluating AI-generated peer review reports.<n>First, the system is able to simulate calibrated editorial judgment, with 'Revise' decisions consistently forming the majority outcome.<n>Second, it maintains unwavering procedural integrity, enforcing a stable 29% evidence-anchoring compliance rate.
arXiv Detail & Related papers (2025-10-02T13:59:14Z) - AI and the Future of Academic Peer Review [0.1622854284766506]
Large language models (LLMs) are being piloted across the peer-review pipeline by journals, funders, and individual reviewers.<n>Early studies suggest that AI assistance can produce reviews comparable in quality to humans.<n>We show that supervised LLM assistance can improve error detection, timeliness, and reviewer workload without displacing human judgment.
arXiv Detail & Related papers (2025-09-17T17:27:12Z) - Never Compromise to Vulnerabilities: A Comprehensive Survey on AI Governance [211.5823259429128]
We propose a comprehensive framework integrating technical and societal dimensions, structured around three interconnected pillars: Intrinsic Security, Derivative Security, and Social Ethics.<n>We identify three core challenges: (1) the generalization gap, where defenses fail against evolving threats; (2) inadequate evaluation protocols that overlook real-world risks; and (3) fragmented regulations leading to inconsistent oversight.<n>Our framework offers actionable guidance for researchers, engineers, and policymakers to develop AI systems that are not only robust and secure but also ethically aligned and publicly trustworthy.
arXiv Detail & Related papers (2025-08-12T09:42:56Z) - When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs [8.575522204707958]
Large language models (LLMs) grow in capability and autonomy, evaluating their outputs-especially in open-ended and complex tasks-has become a critical bottleneck.<n>A new paradigm is emerging: using AI agents as the evaluators themselves.<n>In this review, we define the agent-as-a-judge concept, trace its evolution from single-model judges to dynamic multi-agent debate frameworks, and critically examine their strengths and shortcomings.
arXiv Detail & Related papers (2025-08-05T01:42:25Z) - The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z) - Anchoring AI Capabilities in Market Valuations: The Capability Realization Rate Model and Valuation Misalignment Risk [2.1142253753427402]
Recent breakthroughs in artificial intelligence have triggered surges in market valuations for AI-related companies.<n>We propose a Capability Realization Rate model to quantify the gap between AI potential and realized performance.<n>We conclude with policy recommendations to improve transparency, mitigate speculative bubbles, and align AI innovation with sustainable market value.
arXiv Detail & Related papers (2025-05-15T01:06:06Z) - AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability.<n>The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z) - Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives [52.863024096759816]
Misaligned research objectives have hindered progress in adversarial robustness research over the past decade.<n>We argue that realigned objectives are necessary for meaningful progress in adversarial alignment.
arXiv Detail & Related papers (2025-02-17T15:28:40Z) - Trustworthy AI [75.99046162669997]
Brittleness to minor adversarial changes in the input data, ability to explain the decisions, address the bias in their training data, are some of the most prominent limitations.
We propose the tutorial on Trustworthy AI to address six critical issues in enhancing user and public trust in AI systems.
arXiv Detail & Related papers (2020-11-02T20:04:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.