Related papers: Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores

Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores

URL: http://arxiv.org/abs/2601.13885v1
Date: Tue, 20 Jan 2026 11:59:13 GMT
Title: Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores
Authors: Esma Balkır, Alice Pernthaller, Marco Basaldella, José Hernández-Orallo, Nigel Collier,
Abstract summary: We present a principled extension of IRT-based adaptive testing to continuous bounded scores (ROUGE, BLEU, LLM-as-a-Judge)<n>We introduce an uncertainty aware ranker with adaptive stopping criteria that achieves reliable model ranking while testing as few items as possible.<n>Our method uses 2% of the items while improving ranking correlation by 0.12 over random sampling, with 95% accuracy on confident predictions.
Score: 25.638175689769934
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Computerized Adaptive Testing (CAT) has proven effective for efficient LLM evaluation on multiple-choice benchmarks, but modern LLM evaluation increasingly relies on generation tasks where outputs are scored continuously rather than marked correct/incorrect. We present a principled extension of IRT-based adaptive testing to continuous bounded scores (ROUGE, BLEU, LLM-as-a-Judge) by replacing the Bernoulli response distribution with a heteroskedastic normal distribution. Building on this, we introduce an uncertainty aware ranker with adaptive stopping criteria that achieves reliable model ranking while testing as few items and as cheaply as possible. We validate our method on five benchmarks spanning n-gram-based, embedding-based, and LLM-as-judge metrics. Our method uses 2% of the items while improving ranking correlation by 0.12 τ over random sampling, with 95% accuracy on confident predictions.

Related papers

Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals [18.612081365101464]
We develop a framework that combines standard labeled outcomes with pairwise comparison signals obtained by having models judge auxiliary reasoning chains.<n>Across simulations, our one-step estimator substantially improves ranking accuracy with gains increasing as model output noise grows.<n>Experiments on GPQA Diamond, AIME 2025 and GSM8K further demonstrate more precise performance estimation and more reliable model rankings.
arXiv Detail & Related papers (2026-02-03T03:40:01Z)
How to Correctly Report LLM-as-a-Judge Evaluations [13.389479132464778]
Large language models (LLMs) are increasingly used as evaluators in lieu of humans.<n>While scalable, their judgments are noisy due to imperfect specificity and sensitivity of LLMs.<n>This work presents a simple plug-in framework that corrects such bias and constructs confidence intervals reflecting uncertainty from both test and calibration dataset.
arXiv Detail & Related papers (2025-11-26T07:46:46Z)
Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data? [82.09573568241724]
EssenceBench is a coarse-to-fine framework utilizing an iterative Genetic Algorithm (GA)<n>Our approach yields superior compression results with lower reconstruction error and markedly higher efficiency.<n>On the HellaSwag benchmark (10K samples), our method preserves the ranking of all models shifting within 5% using 25x fewer samples, and achieves 95% ranking preservation shifting within 5% using only 200x fewer samples.
arXiv Detail & Related papers (2025-10-12T05:38:10Z)
Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z)
Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores [2.886479348067378]
We use benchmarks designed for testing large language models' capacity to reason about cardinal directions.<n>We suggest a simple method for cost-effectively quantifying the uncertainty of a benchmark score.
arXiv Detail & Related papers (2024-10-04T15:04:28Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Regression-aware Inference with LLMs [52.764328080398805]
We show that an inference strategy can be sub-optimal for common regression and scoring evaluation metrics. We propose alternate inference strategies that estimate the Bayes-optimal solution for regression and scoring metrics in closed-form from sampled responses.
arXiv Detail & Related papers (2024-03-07T03:24:34Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs [60.58434523646137]
A popular approach for improving the correctness of output from large language models (LLMs) is Self-Consistency. We introduce Adaptive-Consistency, a cost-efficient, model-agnostic technique that dynamically adjusts the number of samples per question. Our experiments show that Adaptive-Consistency reduces sample budget by up to 7.9 times with an average accuracy drop of less than 0.1%.
arXiv Detail & Related papers (2023-05-19T17:49:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.