Related papers: EffiEval: Efficient and Generalizable Model Evaluation via Capability Coverage Maximization

EffiEval: Efficient and Generalizable Model Evaluation via Capability Coverage Maximization

URL: http://arxiv.org/abs/2508.09662v1
Date: Wed, 13 Aug 2025 09:48:23 GMT
Title: EffiEval: Efficient and Generalizable Model Evaluation via Capability Coverage Maximization
Authors: Yaoning Wang, Jiahao Ying, Yixin Cao, Yubo Ma, Yugang Jiang,
Abstract summary: EffiEval is a training-free approach for efficient benchmarking that addresses data redundancy while maintaining high evaluation reliability.<n>Our method is specifically designed to meet three key criteria for high-quality evaluation: representativeness, fairness, and generalizability.<n>EffiEval achieves strong ranking consistency with full-dataset evaluation using only a small fraction of the original data.
Score: 48.27039405295434
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid advancement of large language models (LLMs) and the development of increasingly large and diverse evaluation benchmarks have introduced substantial computational challenges for model assessment. In this paper, we present EffiEval, a training-free approach for efficient benchmarking that effectively addresses data redundancy while maintaining high evaluation reliability. Our method is specifically designed to meet three key criteria for high-quality evaluation: representativeness, by ensuring comprehensive coverage of model capabilities; fairness, by remaining independent of model performance during sample selection to avoid bias; and generalizability, by enabling flexible transfer across datasets and model families without reliance on large-scale evaluation data. Unlike traditional methods that rely on absolute performance or require extensive evaluation data, our approach adaptively selects high-quality representative subsets based on the Model Utility Index (MUI). Extensive experiments on multiple public benchmarks and diverse LLMs demonstrate that EffiEval achieves strong ranking consistency with full-dataset evaluation using only a small fraction of the original data. Furthermore, our method is flexible and scalable in size, allowing users to balance evaluation efficiency and representativeness according to specific needs. Overall, EffiEval provides a practical and generalizable solution for reliable, fair, and efficient evaluation in the era of LLMs.

Related papers

CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria [48.70940362676624]
We propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method.<n>Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks.
arXiv Detail & Related papers (2026-01-28T07:46:13Z)
The LLM Data Auditor: A Metric-oriented Survey on Quality and Trustworthiness in Evaluating Synthetic Data [25.926467401802046]
Large Language Models (LLMs) have emerged as powerful tools for generating data across various modalities.<n>We propose a framework for evaluating synthetic data from two dimensions: quality and trustworthiness.
arXiv Detail & Related papers (2026-01-25T06:40:25Z)
Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees [36.407171992845456]
We propose textttR-AutoEval+, a novel framework that provides finite-sample reliability guarantees on the model evaluation.<n>The key innovation of textttR-AutoEval+ is an adaptive construction of the model evaluation variable, which dynamically tunes its reliance on synthetic data.
arXiv Detail & Related papers (2025-05-24T11:53:29Z)
Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis [36.689210473887904]
We introduce a benchmarking framework for evaluating cross-dataset prediction generalization in deep learning (DL) and machine learning (ML) models.<n>We quantify both absolute performance (e.g., predictive accuracy across datasets) and relative performance (e.g., performance drop compared to within-dataset results)<n>Our results reveal substantial performance drops when models are tested on unseen datasets, underscoring the importance of rigorous generalization assessments.
arXiv Detail & Related papers (2025-03-18T15:40:18Z)
Rethinking Relation Extraction: Beyond Shortcuts to Generalization with a Debiased Benchmark [53.876493664396506]
Benchmarks are crucial for evaluating machine learning algorithm performance, facilitating comparison and identifying superior solutions.<n>This paper addresses the issue of entity bias in relation extraction tasks, where models tend to rely on entity mentions rather than context.<n>We propose a debiased relation extraction benchmark DREB that breaks the pseudo-correlation between entity mentions and relation types through entity replacement.<n>To establish a new baseline on DREB, we introduce MixDebias, a debiasing method combining data-level and model training-level techniques.
arXiv Detail & Related papers (2025-01-02T17:01:06Z)
A Distributed Collaborative Retrieval Framework Excelling in All Queries and Corpora based on Zero-shot Rank-Oriented Automatic Evaluation [46.33857318525812]
We propose a novel Distributed Collaborative Retrieval Framework (DCRF)<n>It integrates various retrieval models into a unified system and dynamically selects the optimal results for each user's query.<n>It can achieve performance comparable to effective listwise methods like RankGPT and ListT5.
arXiv Detail & Related papers (2024-12-16T14:55:57Z)
CARMO: Dynamic Criteria Generation for Context-Aware Reward Modelling [27.86204841898399]
Reward modeling in large language models is susceptible to reward hacking.<n>We propose Context-Aware Reward Modeling (CARMO) to mitigate this problem.<n>We establish a new state-of-the-art performance in zero-shot settings for generative models, achieving a 2.1% improvement on Reward Bench.
arXiv Detail & Related papers (2024-10-28T21:18:49Z)
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures [28.130008435669865]
We introduce MixEval-X, the first any-to-any, real-world benchmark designed to optimize evaluations across diverse input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions.
arXiv Detail & Related papers (2024-10-17T16:52:28Z)
FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models [36.273451767886726]
FreeEval is a modular and scalable framework crafted to enable trustworthy and efficient automatic evaluations of large language models. FreeEval's unified abstractions simplify the integration and improve the transparency of diverse evaluation methodologies. The framework integrates meta-evaluation techniques like human evaluation and data contamination detection, which, along with dynamic evaluation modules, enhance the fairness of the evaluation outcomes.
arXiv Detail & Related papers (2024-04-09T04:17:51Z)
GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models. We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench. GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z)
Are Sample-Efficient NLP Models More Robust? [90.54786862811183]
We investigate the relationship between sample efficiency (amount of data needed to reach a given ID accuracy) and robustness (how models fare on OOD evaluation) We find that higher sample efficiency is only correlated with better average OOD robustness on some modeling interventions and tasks, but not others. These results suggest that general-purpose methods for improving sample efficiency are unlikely to yield universal OOD robustness improvements, since such improvements are highly dataset- and task-dependent.
arXiv Detail & Related papers (2022-10-12T17:54:59Z)
Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management. We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.