Learning More from Less: Unlocking Internal Representations for Benchmark Compression
- URL: http://arxiv.org/abs/2602.00710v2
- Date: Tue, 03 Feb 2026 05:51:47 GMT
- Title: Learning More from Less: Unlocking Internal Representations for Benchmark Compression
- Authors: Yueqi Zhang, Jin Hu, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Yiwei Li, Jiayi Shi, Chuyi Tan, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li,
- Abstract summary: We introduce REPCORE, which aligns heterogeneous hidden states into a unified latent space to construct representative coresets.<n>Experiments on five benchmarks and over 200 models show consistent gains over output-based baselines in ranking correlation and estimation accuracy.
- Score: 37.69575776639016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The prohibitive cost of evaluating Large Language Models (LLMs) necessitates efficient alternatives to full-scale benchmarking. Prevalent approaches address this by identifying a small coreset of items to approximate full-benchmark performance. However, existing methods must estimate a reliable item profile from response patterns across many source models, which becomes statistically unstable when the source pool is small. This dependency is particularly limiting for newly released benchmarks with minimal historical evaluation data. We argue that discrete correctness labels are a lossy view of the model's decision process and fail to capture information encoded in hidden states. To address this, we introduce REPCORE, which aligns heterogeneous hidden states into a unified latent space to construct representative coresets. Using these subsets for performance extrapolation, REPCORE achieves precise estimation accuracy with as few as ten source models. Experiments on five benchmarks and over 200 models show consistent gains over output-based baselines in ranking correlation and estimation accuracy. Spectral analysis further indicates that the aligned representations contain separable components reflecting broad response tendencies and task-specific reasoning patterns.
Related papers
- IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation [85.56193980646981]
We propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following.<n>For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses.<n>Experiments on IF-RewardBench reveal significant deficiencies in current judge models.
arXiv Detail & Related papers (2026-03-05T02:21:17Z) - How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains [7.845652284569666]
miscalibration of Large Reasoning Models undermines their reliability in high-stakes domains.<n>We introduce the Reasoning Model Confidence estimation Benchmark (RMCB), a public resource of 347,496 reasoning traces from six popular LRMs.
arXiv Detail & Related papers (2026-01-13T01:55:48Z) - Uncovering Competency Gaps in Large Language Models and Their Benchmarks [11.572508874955659]
We propose a new method that uses sparse autoencoders (SAEs) to automatically uncover both types of gaps.<n>We found that models consistently underperformed on concepts that stand in contrast to sycophantic behaviors.<n>Our method offers a representation-grounded approach to evaluation, enabling concept-level decomposition of benchmark scores.
arXiv Detail & Related papers (2025-12-06T17:39:47Z) - Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings [23.9553588103042]
We propose a item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves.<n>We show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity.<n>We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation.
arXiv Detail & Related papers (2025-10-30T11:28:58Z) - Model Correlation Detection via Random Selection Probing [62.093777777813756]
Existing similarity-based methods require access to model parameters or produce scores without thresholds.<n>We introduce Random Selection Probing (RSP), a hypothesis-testing framework that formulates model correlation detection as a statistical test.<n>RSP produces rigorous p-values that quantify evidence of correlation.
arXiv Detail & Related papers (2025-09-29T01:40:26Z) - The NazoNazo Benchmark: A Cost-Effective and Extensible Test of Insight-Based Reasoning in LLMs [3.9977256267361754]
We present Nazonazo, a cost-effective benchmark built from Japanese children's riddles to test insight-based reasoning.<n>No model except for GPT-5 is comparable to human performance, which achieves a 52.9% mean accuracy.
arXiv Detail & Related papers (2025-09-18T07:50:04Z) - Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z) - Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications.<n> Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z) - Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer.
The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [49.15931834209624]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.<n>We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.<n>By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.