Related papers: Active Evaluation Acquisition for Efficient LLM Benchmarking

Active Evaluation Acquisition for Efficient LLM Benchmarking

URL: http://arxiv.org/abs/2410.05952v1
Date: Tue, 8 Oct 2024 12:08:46 GMT
Title: Active Evaluation Acquisition for Efficient LLM Benchmarking
Authors: Yang Li, Jie Ma, Miguel Ballesteros, Yassine Benajiba, Graham Horwood,
Abstract summary: We investigate strategies to improve evaluation efficiency by selecting a subset of examples from each benchmark using a learned policy. Our approach models the dependencies across test examples, allowing accurate prediction of the evaluation outcomes for the remaining examples. Empirical results demonstrate that our approach significantly reduces the number of evaluation prompts required.
Score: 18.85604491151409
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) become increasingly versatile, numerous large scale benchmarks have been developed to thoroughly assess their capabilities. These benchmarks typically consist of diverse datasets and prompts to evaluate different aspects of LLM performance. However, comprehensive evaluations on hundreds or thousands of prompts incur tremendous costs in terms of computation, money, and time. In this work, we investigate strategies to improve evaluation efficiency by selecting a subset of examples from each benchmark using a learned policy. Our approach models the dependencies across test examples, allowing accurate prediction of the evaluation outcomes for the remaining examples based on the outcomes of the selected ones. Consequently, we only need to acquire the actual evaluation outcomes for the selected subset. We rigorously explore various subset selection policies and introduce a novel RL-based policy that leverages the captured dependencies. Empirical results demonstrate that our approach significantly reduces the number of evaluation prompts required while maintaining accurate performance estimates compared to previous methods.

Related papers

Efficient Evaluation of Large Language Models via Collaborative Filtering [25.734508624520164]
Large Language Models (LLMs) have been proposed to measure and compare the capabilities of different LLMs. evaluating LLMs is costly due to the large number of test instances and their slow inference speed. We propose a two-stage method to efficiently estimate a model's real performance on a given benchmark.
arXiv Detail & Related papers (2025-04-05T07:46:30Z)
PanguIR Technical Report for NTCIR-18 AEOLLM Task [12.061652026366591]
Large language models (LLMs) are increasingly critical and challenging to evaluate. Manual evaluation, while comprehensive, is often costly and resource-intensive. automatic evaluation offers greater scalability but is constrained by the limitations of its evaluation criteria.
arXiv Detail & Related papers (2025-03-04T07:40:02Z)
Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications. Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z)
Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models [0.29687381456164]
VARCO Arena is a novel, cost-effective, and robust benchmarking approach for large language models. Our results demonstrate that VARCO Arena not only produces reliable LLM rankings but also provides a scalable, adaptable solution for qualitative evaluation.
arXiv Detail & Related papers (2024-11-02T15:23:28Z)
Unveiling Context-Aware Criteria in Self-Assessing LLMs [28.156979106994537]
We propose a novel Self-Assessing LLM framework that integrates Context-Aware Criteria (SALC) with dynamic knowledge tailored to each evaluation instance. Empirical evaluations demonstrate that our approach significantly outperforms existing baseline evaluation frameworks. Our method also exhibits a improvement in LC Win-Rate in AlpacaEval2 leaderboard up to a 12% when employed for preference data generation.
arXiv Detail & Related papers (2024-10-28T21:18:49Z)
ReIFE: Re-evaluating Instruction-Following Evaluation [105.75525154888655]
We present a thorough meta-evaluation of instruction following, including 25 base LLMs and 15 proposed evaluation protocols. Our evaluation allows us to identify the best-performing base LLMs and evaluation protocols with a high degree of robustness.
arXiv Detail & Related papers (2024-10-09T17:14:50Z)
AIME: AI System Optimization via Multiple LLM Evaluators [79.03422337674664]
AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation. We show AIME outperforming baseline methods in code generation tasks, with up to $62%$ higher error detection rate and up to $16%$ higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets.
arXiv Detail & Related papers (2024-10-04T04:03:24Z)
Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation [5.653106385738822]
Polyrating is an expressive and flexible rating system based on a maximum posteriori estimation. It can detect and quantify biases affecting human preferences, ensuring fairer model comparisons. It can reduce the cost of human evaluations by up to $41%$ for new models and up to $77%$ for new tasks.
arXiv Detail & Related papers (2024-09-01T11:24:54Z)
Automated Off-Policy Estimator Selection via Supervised Learning [7.476028372444458]
Off-Policy Evaluation (OPE) problem consists of evaluating the performance of counterfactual policies with data collected by another one. To solve the OPE problem, we resort to estimators, which aim to estimate in the most accurate way possible the performance that the counterfactual policies would have had if they were deployed in place of the logging policy. We propose an automated data-driven OPE estimator selection method based on supervised learning.
arXiv Detail & Related papers (2024-06-26T02:34:48Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs. We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)
LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models [28.441725610692714]
We propose a unified multi-dimensional automatic evaluation method for open-domain conversations with large language models (LLMs) We design a single prompt-based evaluation method that leverages a unified evaluation schema to cover multiple dimensions of conversation quality in a single model call. We extensively evaluate the performance of LLM-Eval on various benchmark datasets, demonstrating its effectiveness, efficiency, and adaptability compared to state-of-the-art evaluation methods.
arXiv Detail & Related papers (2023-05-23T05:57:09Z)
Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy. We propose a doubly-robust inference procedure for quantile OPE in sequential decision making. We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z)
Optimal Off-Policy Evaluation from Multiple Logging Policies [77.62012545592233]
We study off-policy evaluation from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling. We find the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one.
arXiv Detail & Related papers (2020-10-21T13:43:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.