Don't Make Your LLM an Evaluation Benchmark Cheater
- URL: http://arxiv.org/abs/2311.01964v1
- Date: Fri, 3 Nov 2023 14:59:54 GMT
- Title: Don't Make Your LLM an Evaluation Benchmark Cheater
- Authors: Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu
Chen, Yankai Lin, Ji-Rong Wen, Jiawei Han
- Abstract summary: Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
- Score: 142.24553056600627
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models~(LLMs) have greatly advanced the frontiers of
artificial intelligence, attaining remarkable improvement in model capacity. To
assess the model performance, a typical approach is to construct evaluation
benchmarks for measuring the ability level of LLMs in different aspects.
Despite that a number of high-quality benchmarks have been released, the
concerns about the appropriate use of these benchmarks and the fair comparison
of different models are increasingly growing. Considering these concerns, in
this paper, we discuss the potential risk and impact of inappropriately using
evaluation benchmarks and misleadingly interpreting the evaluation results.
Specially, we focus on a special issue that would lead to inappropriate
evaluation, \ie \emph{benchmark leakage}, referring that the data related to
evaluation sets is occasionally used for model training. This phenomenon now
becomes more common since pre-training data is often prepared ahead of model
test. We conduct extensive experiments to study the effect of benchmark
leverage, and find that it can dramatically boost the evaluation results, which
would finally lead to an unreliable assessment of model performance. To improve
the use of existing evaluation benchmarks, we finally present several
guidelines for both LLM developers and benchmark maintainers. We hope this work
can draw attention to appropriate training and evaluation of LLMs.
Related papers
- Unveiling Context-Aware Criteria in Self-Assessing LLMs [28.156979106994537]
We propose a novel Self-Assessing LLM framework that integrates Context-Aware Criteria (SALC) with dynamic knowledge tailored to each evaluation instance.
Empirical evaluations demonstrate that our approach significantly outperforms existing baseline evaluation frameworks.
Our method also exhibits a improvement in LC Win-Rate in AlpacaEval2 leaderboard up to a 12% when employed for preference data generation.
arXiv Detail & Related papers (2024-10-28T21:18:49Z) - RMB: Comprehensively Benchmarking Reward Models in LLM Alignment [44.84304822376291]
Reward models (RMs) guide the alignment of large language models (LLMs)
We propose RMB, a comprehensive RM benchmark that covers over 49 real-world scenarios.
Based on our benchmark, we conduct extensive analysis on the state-of-the-art RMs.
arXiv Detail & Related papers (2024-10-13T16:06:54Z) - Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only.
Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - Efficient Benchmarking of Language Models [22.696230279151166]
We present the problem of Efficient Benchmarking, namely, intelligently reducing the costs of LM evaluation without compromising reliability.
Using the HELM benchmark as a test case, we investigate how different benchmark design choices affect the computation-reliability trade-off.
We propose an evaluation algorithm, that, when applied to the HELM benchmark, leads to dramatic cost savings with minimal loss of benchmark reliability.
arXiv Detail & Related papers (2023-08-22T17:59:30Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.