Efficient Evaluation of LLM Performance with Statistical Guarantees
- URL: http://arxiv.org/abs/2601.20251v2
- Date: Thu, 29 Jan 2026 03:01:40 GMT
- Title: Efficient Evaluation of LLM Performance with Statistical Guarantees
- Authors: Skyler Wu, Yash Nair, Emmanuel J. Candès,
- Abstract summary: We propose Factorized Active Querying (FAQ) for benchmarking large language models.<n>FAQ adaptively selects questions using a hybrid variance-reduction/active-learning sampling policy.<n>FAQ delivers up to $5times$ effective sample size gains over strong baselines on two benchmark suites.
- Score: 11.703733256169214
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Exhaustively evaluating many large language models (LLMs) on a large suite of benchmarks is expensive. We cast benchmarking as finite-population inference and, under a fixed query budget, seek tight confidence intervals (CIs) for model accuracy with valid frequentist coverage. We propose Factorized Active Querying (FAQ), which (a) leverages historical information through a Bayesian factor model; (b) adaptively selects questions using a hybrid variance-reduction/active-learning sampling policy; and (c) maintains validity through Proactive Active Inference -- a finite-population extension of active inference (Zrnic & Candès, 2024) that enables direct question selection while preserving coverage. With negligible overhead cost, FAQ delivers up to $5\times$ effective sample size gains over strong baselines on two benchmark suites, across varying historical-data missingness levels: this means that it matches the CI width of uniform sampling while using up to $5\times$ fewer queries. We release our source code and our curated datasets to support reproducible evaluation and future research.
Related papers
- Active Transfer Bagging: A New Approach for Accelerated Active Learning Acquisition of Data by Combined Transfer Learning and Bagging Based Models [0.0]
We introduce a new method for selecting the seed data set for active learning, Active-Transfer Bagging (ATBagging)<n>ATBagging estimates the informativeness of candidate data point from a Bayesian interpretation of bagged ensemble models.<n>We evaluate ATBagging on four real-world datasets covering both target-transfer and feature-shift scenarios.
arXiv Detail & Related papers (2026-02-02T18:15:50Z) - Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR [18.494852448006462]
Existing RLVR algorithms require large query budgets, making annotation costly.<n>We investigate whether fewer but more informative queries can yield similar or superior performance, introducing active learning (AL) into RLVR.<n>Experiments show our method consistently outperforms random and classic AL baselines, achieving full-dataset performance while training on only 30% of the data.
arXiv Detail & Related papers (2026-01-30T05:41:55Z) - Audit Me If You Can: Query-Efficient Active Fairness Auditing of Black-Box LLMs [4.673176641454931]
Large Language Models (LLMs) exhibit systematic biases across demographic groups.<n>We conceptualise auditing as uncertainty estimation over a target fairness metric.<n>We introduce BAFA, the Bounded Active Fairness Auditor for query-efficient auditing of black-box LLMs.
arXiv Detail & Related papers (2026-01-06T15:22:23Z) - Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization [56.97588709890706]
LongMab-PO is a novel framework that generates high-quality and diverse responses for long-context modeling tasks.<n> Experimental results show that LongMab-PO significantly improves the diversity and quality of preference data pairs.
arXiv Detail & Related papers (2025-08-19T16:33:55Z) - Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation [110.610512800947]
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating retrieved information.<n>In RAG, the emphasis has shifted to utility, which considers the usefulness of passages for generating accurate answers.<n>Our approach focuses on utility-based selection rather than ranking, enabling dynamic passage selection tailored to specific queries without the need for fixed thresholds.<n>Our experiments demonstrate that utility-based selection provides a flexible and cost-effective solution for RAG, significantly reducing computational costs while improving answer quality.
arXiv Detail & Related papers (2025-07-25T09:32:29Z) - Reliable and Efficient Amortized Model-based Evaluation [57.6469531082784]
The average score across a wide range of benchmarks provides a signal that helps guide the use of language models in practice.<n>A popular attempt to lower the cost is to compute the average score on a subset of the benchmark.<n>This approach often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset.<n>We train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost.
arXiv Detail & Related papers (2025-03-17T16:15:02Z) - Scalable Best-of-N Selection for Large Language Models via Self-Certainty [75.1351701045874]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs)<n>We propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models.<n>Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - Investigating Data Contamination in Modern Benchmarks for Large Language Models [27.479260572913724]
Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs.
We study data contamination by proposing two methods tailored for both open-source and proprietary LLMs.
We find that certain commercial LLMs could surprisingly guess the missing option in various test sets.
arXiv Detail & Related papers (2023-11-16T11:03:04Z) - Efficient Detection of LLM-generated Texts with a Bayesian Surrogate Model [14.98695074168234]
We propose a new method to detect machine-generated text, especially from large language models (LLMs)
We use a Bayesian surrogate model, which allows us to select typical samples based on Bayesian uncertainty and interpolate scores from typical samples to other samples, to improve query efficiency.
Empirical results demonstrate that our method significantly outperforms existing approaches under a low query budget.
arXiv Detail & Related papers (2023-05-26T04:23:10Z) - Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning
and Coding with LLMs [60.58434523646137]
A popular approach for improving the correctness of output from large language models (LLMs) is Self-Consistency.
We introduce Adaptive-Consistency, a cost-efficient, model-agnostic technique that dynamically adjusts the number of samples per question.
Our experiments show that Adaptive-Consistency reduces sample budget by up to 7.9 times with an average accuracy drop of less than 0.1%.
arXiv Detail & Related papers (2023-05-19T17:49:25Z) - Optimal Off-Policy Evaluation from Multiple Logging Policies [77.62012545592233]
We study off-policy evaluation from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling.
We find the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one.
arXiv Detail & Related papers (2020-10-21T13:43:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.