Related papers: Benchmarks Saturate When The Model Gets Smarter Than The Judge

Benchmarks Saturate When The Model Gets Smarter Than The Judge

URL: http://arxiv.org/abs/2601.19532v1
Date: Tue, 27 Jan 2026 12:20:44 GMT
Title: Benchmarks Saturate When The Model Gets Smarter Than The Judge
Authors: Marthe Ballon, Andres Algaba, Brecht Verbeken, Vincent Ginis,
Abstract summary: We present a manually revised version of the Omni-MATH dataset.<n>Each problem was audited to ensure compilability, solvability and verifiability.<n>We compare GPT-5 mini with the original Omni-Judge, revealing substantial discrepancies between judges on both the clean and tagged problem subsets.
Score: 4.599673637363014
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Benchmarks are important tools to track progress in the development of Large Language Models (LLMs), yet inaccuracies in datasets and evaluation methods consistently undermine their effectiveness. Here, we present Omni-MATH-2, a manually revised version of the Omni-MATH dataset comprising a clean, exact-answer subset ($n{=}4181$) and a tagged, non-standard subset ($n{=}247$). Each problem was audited to ensure LaTeX compilability, solvability and verifiability, which involved adding missing figures or information, labeling problems requiring a proof, estimation or image, and removing clutter. This process significantly reduces dataset-induced noise, thereby providing a more precise assessment of model performance. The annotated dataset also allows us to evaluate judge-induced noise by comparing GPT-5 mini with the original Omni-Judge, revealing substantial discrepancies between judges on both the clean and tagged problem subsets. Expert annotations reveal that Omni-Judge is wrong in $96.4\%$ of the judge disagreements, indicating its inability to differentiate between models' abilities, even well before saturation of the benchmark occurs. As problems become more challenging, we find that increasingly competent judges become essential in order to prevent judge errors from masking genuine differences between models. Finally, neither judge identifies the present failure modes for the subset of tagged problems, demonstrating that dataset quality and judge reliability are both critical to develop accurate benchmarks of model performance.

Related papers

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation [85.56193980646981]
We propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following.<n>For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses.<n>Experiments on IF-RewardBench reveal significant deficiencies in current judge models.
arXiv Detail & Related papers (2026-03-05T02:21:17Z)
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam [63.84155758655084]
Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models.<n>We introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy.<n>We evaluate seven state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points.
arXiv Detail & Related papers (2026-02-15T02:50:15Z)
A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth [4.9467757325435775]
evaluating large language models (LLMs) on open-ended tasks is increasingly done via the LLM-as-a-judge paradigm.<n>Treating all judges equally can yield biased leaderboards and misleading uncertainty estimates.<n>We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters.
arXiv Detail & Related papers (2026-01-29T15:01:28Z)
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.<n>LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.<n>We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z)
Label Convergence: Defining an Upper Performance Bound in Object Recognition through Contradictory Annotations [0.0]
We introduce the notion of "label convergence" to describe the highest achievable performance under the constraint of contradictory test annotations.<n>We approximate that label convergence is between 62.63-67.52 mAP@[0.5:0.95:0.05] for LVIS with 95% confidence, attributing these bounds to the presence of real annotation errors.<n>With current state-of-the-art (SOTA) models at the upper end of the label convergence interval for the well-studied LVIS dataset, we conclude that model capacity is sufficient to solve current object detection problems.
arXiv Detail & Related papers (2024-09-14T10:59:25Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Systematic analysis of the impact of label noise correction on ML Fairness [0.0]
We develop an empirical methodology to evaluate the effectiveness of label noise correction techniques in ensuring the fairness of models trained on biased datasets. Our results suggest that the Hybrid Label Noise Correction method achieves the best trade-off between predictive performance and fairness.
arXiv Detail & Related papers (2023-06-28T08:08:14Z)
Intrinsic Self-Supervision for Data Quality Audits [35.69673085324971]
Benchmark datasets in computer vision often contain off-topic images, near duplicates, and label errors. In this paper, we revisit the task of data cleaning and formalize it as either a ranking problem, or a scoring problem. We find that a specific combination of context-aware self-supervised representation learning and distance-based indicators is effective in finding issues without annotation biases.
arXiv Detail & Related papers (2023-05-26T15:57:04Z)
Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors [105.12462629663757]
In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model. We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models.
arXiv Detail & Related papers (2022-05-25T15:26:48Z)
Evaluating Models' Local Decision Boundaries via Contrast Sets [119.38387782979474]
We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets. Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets.
arXiv Detail & Related papers (2020-04-06T14:47:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.