SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization
- URL: http://arxiv.org/abs/2602.07909v1
- Date: Sun, 08 Feb 2026 11:12:45 GMT
- Title: SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization
- Authors: Taolin Zhang, Hang Guo, Wang Lu, Tao Dai, Shu-Tao Xia, Jindong Wang,
- Abstract summary: Large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved.<n> evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs.<n>We propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection.
- Score: 64.95852289011385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved. However, evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs. In this paper, we revisit the model-item performance matrix and show that it exhibits sparsity, that representative items can be selected as anchors, and that the task of efficient benchmarking can be formulated as a sparse optimization problem. Based on these insights, we propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection. We utilize the representation capacity of MLP to handle sparse optimization and propose the Anchor Importance Score and Candidate Importance Score to evaluate the value of each item for task-aware refinement. Extensive experiments demonstrate the low estimation error and high Kendall's~$τ$ of our method across a variety of benchmarks, showcasing its superior robustness and practicality in real-world scenarios. Code is available at {https://github.com/taolinzhang/SparseEval}.
Related papers
- K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge [51.93484138861584]
The rapid development of visual generative models raises the need for more scalable and human-aligned evaluation methods.<n>We propose K-Sort Eval, a reliable and efficient VLM-based evaluation framework that integrates posterior correction and dynamic matching.<n>Experiments show that K-Sort Eval delivers evaluation results consistent with K-Sort Arena, typically requiring fewer than 90 model runs.
arXiv Detail & Related papers (2026-02-10T05:07:46Z) - Autocorrelated Optimize-via-Estimate: Predict-then-Optimize versus Finite-sample Optimal [2.0228793142608588]
Models that directly optimize for out-of-sample performance in the finite-sample regime have emerged as a promising alternative to traditional estimate-then-optimize approaches.<n>We compare their performance in the context of autocorrelated uncertainties, specifically, under a Vector Autoregressive Moving Average VARMA(p,q) process.
arXiv Detail & Related papers (2026-02-02T09:49:51Z) - Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings [23.9553588103042]
We propose a item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves.<n>We show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity.<n>We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation.
arXiv Detail & Related papers (2025-10-30T11:28:58Z) - carps: A Framework for Comparing N Hyperparameter Optimizers on M Benchmarks [61.79411281702448]
carps is a benchmark framework for Comprehensive Automated Research Performance Studies.<n>We focus on the four most important types of HPO task types: blackbox, multi-objective, multi-fidelity-multi-objective.<n>With 3 336 tasks from 5 community benchmark collections and 28 variants of 9 families, we offer the biggest go-to library to date.
arXiv Detail & Related papers (2025-06-06T15:01:39Z) - On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - $φ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation [22.607133083903125]
In-time optimization scales computation to derive deliberate reasoning steps for effective performance.<n>We frame the decoding strategy as foresight sampling, leveraging simulated future steps to obtain globally optimal step estimation.<n>Experiments show $phi$-Decoding outperforms strong baselines in both performance and efficiency.
arXiv Detail & Related papers (2025-03-17T15:38:33Z) - Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications.<n> Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z) - Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)<n>RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.<n>RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z) - In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement [71.60563181678323]
Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality.<n>To handle these challenges, a direct solution is to generate high-confidence'' data from unsupervised downstream tasks.<n>We propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision.
arXiv Detail & Related papers (2024-10-04T03:39:28Z) - Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer.
The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z) - IOHanalyzer: Detailed Performance Analyses for Iterative Optimization
Heuristics [3.967483941966979]
IOHanalyzer is a new user-friendly tool for the analysis, comparison, and visualization of performance data of IOHs.
IOHanalyzer provides detailed statistics about fixed-target running times and about fixed-budget performance of the benchmarked algorithms.
IOHanalyzer can directly process performance data from the main benchmarking platforms.
arXiv Detail & Related papers (2020-07-08T08:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.