Benchmarking at the Edge of Comprehension
- URL: http://arxiv.org/abs/2602.14307v2
- Date: Thu, 19 Feb 2026 20:40:13 GMT
- Title: Benchmarking at the Edge of Comprehension
- Authors: Samuele Marro, Jialin Yu, Emanuele La Malfa, Oishi Deb, Jiawei Li, Yibo Yang, Ebey Abraham, Sunando Sengupta, Eric Sommerlade, Michael Wooldridge, Philip Torr,
- Abstract summary: If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake.<n>We propose Critique-Resilient Benchmarking, an adversarial framework designed to compare models even when full human understanding is infeasible.<n>Unlike standard benchmarking, humans serve as bounded verifiers and focus on localized claims.
- Score: 38.43582342860192
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a juncture: if frontier models keep improving, it will become increasingly hard for humans to generate discriminative tasks, provide accurate ground-truth answers, or evaluate complex solutions. If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake. We refer to this scenario as the post-comprehension regime. In this work, we propose Critique-Resilient Benchmarking, an adversarial framework designed to compare models even when full human understanding is infeasible. Our technique relies on the notion of critique-resilient correctness: an answer is deemed correct if no adversary has convincingly proved otherwise. Unlike standard benchmarking, humans serve as bounded verifiers and focus on localized claims, which preserves evaluation integrity beyond full comprehension of the task. Using an itemized bipartite Bradley-Terry model, we jointly rank LLMs by their ability to solve challenging tasks and to generate difficult yet solvable questions. We showcase the effectiveness of our method in the mathematical domain across eight frontier LLMs, showing that the resulting scores are stable and correlate with external capability measures. Our framework reformulates benchmarking as an adversarial generation-evaluation game in which humans serve as final adjudicators.
Related papers
- RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty [102.02839046225468]
RankLLM is a novel framework designed to quantify both question difficulty and model competency.<n>We evaluate 30 models on 35,550 questions across multiple domains.
arXiv Detail & Related papers (2026-02-12T21:28:46Z) - What Defines Good Reasoning in LLMs? Dissecting Reasoning Steps with Multi-Aspect Evaluation [67.47463575774388]
We decompose reasoning quality into two dimensions: relevance and coherence.<n>To measure these aspects reliably, we introduce causal stepwise evaluation (CaSE)<n>We show that curating training data with CaSE-evaluated relevance and coherence directly improves final task performance.
arXiv Detail & Related papers (2025-10-23T14:30:37Z) - CORE: Full-Path Evaluation of LLM Agents Beyond Final State [2.0391237204597368]
Existing agentic benchmarks often reduce evaluation to a binary judgment of the final state.<n>We propose a framework based on deterministic finite automata that encodes tasks as sets of valid tool-use paths.<n>We introduce CORE, a suite of five metrics, namely Path Correctness, Path Correctness - Kendall's tau Composite, Prefix Criticality, Harmful-Call Rate, and Efficiency.
arXiv Detail & Related papers (2025-09-25T10:49:35Z) - The illusion of a perfect metric: Why evaluating AI's words is harder than it looks [0.0]
Natural Language Generation (NLG) is crucial for the practical adoption of AI.<n>Human evaluation is considered the de-facto standard, but it is expensive and lacks scalability.<n>No single metric has emerged as a definitive solution, resulting in studies using different ones without fully considering the implications.
arXiv Detail & Related papers (2025-08-19T13:22:41Z) - Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks [2.3188831772813105]
We propose a debate-driven evaluation paradigm that transforms any existing QA dataset into structured adversarial debates.<n>We make two main contributions: (1) an evaluation pipeline to systematically convert QA tasks into debate-based assessments, and (2) a public benchmark that demonstrates our paradigm's effectiveness on a subset of MMLU-Pro questions.
arXiv Detail & Related papers (2025-07-23T17:58:14Z) - Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks [229.73714829399802]
This survey probes the core challenges that the rise of Large Language Models poses for evaluation.<n>We identify and analyze two pivotal transitions: (i) from task-specific to capability-based evaluation, which reorganizes benchmarks around core competencies such as knowledge, reasoning, instruction following, multi-modal understanding, and safety.<n>We will dissect this issue, along with the core challenges of the above two transitions, from the perspectives of methods, datasets, evaluators, and metrics.
arXiv Detail & Related papers (2025-04-26T07:48:52Z) - More than Marketing? On the Information Value of AI Benchmarks for Practitioners [42.73526862595375]
In academia, public benchmarks were generally viewed as suitable measures for capturing research progress.<n>In product and policy, benchmarks were often found to be inadequate for informing substantive decisions.<n>We conclude that effective benchmarks should provide meaningful, real-world evaluations, incorporate domain expertise, and maintain transparency in scope and goals.
arXiv Detail & Related papers (2024-12-07T03:35:39Z) - A Comparative Analysis on Ethical Benchmarking in Large Language Models [0.0]
This work contributes to the field of Machine Ethics (ME) benchmarking, which develops tests to assess whether intelligent systems accurately represent human values and act accordingly.
We identify three major issues with current ME benchmarks: limited ecological validity due to unrealistic ethical dilemmas, unstructured question generation without clear inclusion/exclusion criteria, and a lack of scalability due to reliance on human annotations.
We introduce two new ME benchmarks: the Triage Benchmark and the Medical Law (MedLaw) Benchmark, both featuring real-world ethical dilemmas from the medical domain.
arXiv Detail & Related papers (2024-10-11T05:05:21Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.<n>A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.