Reliable and Efficient Amortized Model-based Evaluation
- URL: http://arxiv.org/abs/2503.13335v1
- Date: Mon, 17 Mar 2025 16:15:02 GMT
- Title: Reliable and Efficient Amortized Model-based Evaluation
- Authors: Sang Truong, Yuheng Tu, Percy Liang, Bo Li, Sanmi Koyejo,
- Abstract summary: The average score across a wide range of benchmarks provides a signal that helps guide the use of language models in practice.<n>A popular attempt to lower the cost is to compute the average score on a subset of the benchmark.<n>This approach often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset.<n>We train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost.
- Score: 57.6469531082784
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Comprehensive evaluations of language models (LM) during both development and deployment phases are necessary because these models possess numerous capabilities (e.g., mathematical reasoning, legal support, or medical diagnostic) as well as safety risks (e.g., racial bias, toxicity, or misinformation). The average score across a wide range of benchmarks provides a signal that helps guide the use of these LMs in practice. Currently, holistic evaluations are costly due to the large volume of benchmark questions, making frequent evaluations impractical. A popular attempt to lower the cost is to compute the average score on a subset of the benchmark. This approach, unfortunately, often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset. Item response theory (IRT) was designed to address this challenge, providing a reliable measurement by careful controlling for question difficulty. Unfortunately, question difficulty is expensive to estimate. Facing this challenge, we train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost. In addition, we leverage this difficulty predictor to further improve the evaluation efficiency through training a question generator given a difficulty level. This question generator is essential in adaptive testing, where, instead of using a random subset of the benchmark questions, informative questions are adaptively chosen based on the current estimation of LLM performance. Experiments on 22 common natural language benchmarks and 172 LMs show that this approach is more reliable and efficient compared to current common practice.
Related papers
- Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications.<n> Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z) - EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta [2.1249213103048414]
We introduce the EQUATOR Evaluator, which combines deterministic scoring with a focus on factual accuracy and robust reasoning assessment.<n>Using a vector database, EQUATOR pairs open-ended questions with human-evaluated answers, enabling more precise and scalable evaluations.<n>Our results demonstrate that this framework significantly outperforms traditional multiple-choice evaluations while maintaining high accuracy standards.
arXiv Detail & Related papers (2024-12-31T03:56:17Z) - TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles [2.8839090723566296]
TurtleBench collects real user guesses from our online Turtle Soup Puzzle platform.
TurtleBench includes 1,532 user guesses along with the correctness of guesses after annotation.
We thoroughly evaluated nine of the most advanced Large Language Models available today.
arXiv Detail & Related papers (2024-10-07T17:58:47Z) - On Speeding Up Language Model Evaluation [48.51924035873411]
We propose an $textitadaptive$ approach to explore this space.<n>We lean on multi-armed bandits to sequentially identify the next (method, validation sample)-pair to evaluate.<n>We show that it can identify the top-performing method using only 5-15% of the typical resources.
arXiv Detail & Related papers (2024-07-08T17:48:42Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures [57.886592207948844]
We propose MixEval, a new paradigm for establishing efficient, gold-standard evaluation by strategically mixing off-the-shelf benchmarks.
It bridges (1) comprehensive and well-distributed real-world user queries and (2) efficient and fairly-graded ground-truth-based benchmarks, by matching queries mined from the web with similar queries from existing benchmarks.
arXiv Detail & Related papers (2024-06-03T05:47:05Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Self-Evaluation Guided Beam Search for Reasoning [61.523627290397556]
We introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of Large Language Model (LLM)
We propose a decoding algorithm integrating the self-evaluation guidance via beam search.
Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34%$, $9.56%$, and $5.46%$ on the GSM8K, AQuA, and StrategyQA.
arXiv Detail & Related papers (2023-05-01T02:37:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.