Related papers: Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?

Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?

URL: http://arxiv.org/abs/2510.10457v1
Date: Sun, 12 Oct 2025 05:38:10 GMT
Title: Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?
Authors: Shaobo Wang, Cong Wang, Wenjie Fu, Yue Min, Mingquan Feng, Isabel Guan, Xuming Hu, Conghui He, Cunxiang Wang, Kexin Yang, Xingzhang Ren, Fei Huang, Dayiheng Liu, Linfeng Zhang,
Abstract summary: EssenceBench is a coarse-to-fine framework utilizing an iterative Genetic Algorithm (GA)<n>Our approach yields superior compression results with lower reconstruction error and markedly higher efficiency.<n>On the HellaSwag benchmark (10K samples), our method preserves the ranking of all models shifting within 5% using 25x fewer samples, and achieves 95% ranking preservation shifting within 5% using only 200x fewer samples.
Score: 82.09573568241724
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As the demand for comprehensive evaluations of diverse model capabilities steadily increases, benchmark suites have correspondingly grown significantly in scale. Despite notable advances in redundancy reduction and subset-level performance prediction, a systematic framework that effectively integrates these methods to ensure both prediction accuracy and ranking consistency is still largely elusive. In this paper, we first perform a sample-level analysis of benchmark redundancy and identify several highly similar samples that can be eliminated. Besides, we frame benchmark compression as an optimization problem with the aim of score reconstruction. Building on these, we then propose EssenceBench, a coarse-to-fine framework utilizing an iterative Genetic Algorithm (GA), which takes the advantages of fitness-based subset search and attribution-based sample search. Compared to previous methods, our approach yields superior compression results with lower reconstruction error and markedly higher efficiency. In particular, on the HellaSwag benchmark (10K samples), our method preserves the ranking of all models shifting within 5% using 25x fewer samples, and achieves 95% ranking preservation shifting within 5% using only 200x fewer samples.

Related papers

Not All Candidates are Created Equal: A Heterogeneity-Aware Approach to Pre-ranking in Recommender Systems [11.849498011182066]
Heterogeneity-Aware Adaptive Pre-ranking (HAP) is a unified framework that mitigates gradient conflicts through conflict-sensitive sampling.<n>HAP has been deployed in the Toutiao production system for 9 months, yielding up to 0.4% improvement in user app usage duration.
arXiv Detail & Related papers (2026-03-04T06:27:47Z)
Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples [57.67658635348395]
LASER's exhaustive, per-matrix search makes it impractical for rapid deployment.<n>We show that combining these findings yields a fast and robust adaptation algorithm for downstream tasks.
arXiv Detail & Related papers (2025-10-23T17:58:01Z)
How Benchmark Prediction from Fewer Data Misses the Mark [18.693874781163657]
Benchmark prediction aims to select a small subset of evaluation points and predict overall benchmark performance from that subset.<n>This paper systematically assesses the strengths and limitations of 11 benchmark prediction methods across 19 diverse benchmarks.
arXiv Detail & Related papers (2025-06-09T11:50:41Z)
Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation [6.4212082894269535]
We compare existing leakage detection techniques, namely permutation and n-gram-based methods.<n>Our analysis shows that the n-gram method consistently achieves the highest F1-score.<n>We create cleaned versions of MMLU and HellaSwag, and re-evaluate several LLMs.
arXiv Detail & Related papers (2025-05-30T06:37:39Z)
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
Nearly Optimal Sample Complexity for Learning with Label Proportions [54.67830198790247]
We investigate Learning from Label Proportions (LLP), a partial information setting where examples in a training set are grouped into bags.<n>Despite the partial observability, the goal is still to achieve small regret at the level of individual examples.<n>We give results on the sample complexity of LLP under square loss, showing that our sample complexity is essentially optimal.
arXiv Detail & Related papers (2025-05-08T15:45:23Z)
Adaptive Sampled Softmax with Inverted Multi-Index: Methods, Theory and Applications [79.53938312089308]
The MIDX-Sampler is a novel adaptive sampling strategy based on an inverted multi-index approach.<n>Our method is backed by rigorous theoretical analysis, addressing key concerns such as sampling bias, gradient bias, convergence rates, and generalization error bounds.
arXiv Detail & Related papers (2025-01-15T04:09:21Z)
Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling [3.7467864495337624]
SubLIME is a data-efficient evaluation framework for text-to-image models. Our approach ensures statistically aligned model rankings compared to full datasets. We leverage the HEIM leaderboard to cover 25 text-to-image models on 17 different benchmarks.
arXiv Detail & Related papers (2024-06-21T07:38:55Z)
Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data. We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z)
Learning to Select Pivotal Samples for Meta Re-weighting [12.73177872962048]
We study how to learn to identify such a meta sample set from a large, imperfect training set, that is subsequently cleaned and used to optimize performance. We propose two clustering methods within our learning framework, Representation-based clustering method (RBC) and Gradient-based clustering method (GBC)
arXiv Detail & Related papers (2023-02-09T03:04:40Z)
Boosting Randomized Smoothing with Variance Reduced Classifiers [4.110108749051657]
We motivate why ensembles are a particularly suitable choice as base models for Randomized Smoothing (RS) We empirically confirm this choice, obtaining state of the art results in multiple settings.
arXiv Detail & Related papers (2021-06-13T08:40:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.