Elo Uncovered: Robustness and Best Practices in Language Model
Evaluation
- URL: http://arxiv.org/abs/2311.17295v1
- Date: Wed, 29 Nov 2023 00:45:23 GMT
- Title: Elo Uncovered: Robustness and Best Practices in Language Model
Evaluation
- Authors: Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker, Marzieh Fadaee
- Abstract summary: We study two axioms that evaluation methods should adhere to: reliability and transitivity.
We show that these axioms are not always satisfied raising questions about the reliability of current comparative evaluations of LLMs.
- Score: 9.452326973655447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In Natural Language Processing (NLP), the Elo rating system, originally
designed for ranking players in dynamic games such as chess, is increasingly
being used to evaluate Large Language Models (LLMs) through "A vs B" paired
comparisons. However, while popular, the system's suitability for assessing
entities with constant skill levels, such as LLMs, remains relatively
unexplored. We study two fundamental axioms that evaluation methods should
adhere to: reliability and transitivity. We conduct extensive evaluation of Elo
behaviour, illustrating that individual Elo computations exhibit volatility and
delving into the impact of varying the Elo rating system's hyperparameters. We
show that these axioms are not always satisfied raising questions about the
reliability of current comparative evaluations of LLMs. If the current use of
Elo scores is intended to substitute the costly head-to-head comparison of
LLMs, it is crucial to ensure the ranking is as robust as possible. Guided by
the axioms, our findings offer concrete guidelines for enhancing the
reliability of LLM evaluation methods, suggesting a need for reassessment of
existing comparative approaches.
Related papers
- Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat [7.8905223445925055]
Pairwise ranking has emerged as a new method for evaluating human preferences for large language models (LLM)
We explore the effectiveness of ranking systems for head-to-head comparisons of LLMs.
Our analysis uncovers key insights into the factors that affect ranking accuracy and efficiency.
arXiv Detail & Related papers (2024-11-19T20:16:26Z) - The Comparative Trap: Pairwise Comparisons Amplifies Biased Preferences of LLM Evaluators [31.520403357740317]
Large language models (LLMs) are increasingly used as evaluators for natural language generation tasks.
LLMs display biased preferences, such as favoring verbosity and authoritative tones.
We introduce PRePair, which integrates pointwise reasoning within a pairwise framework.
arXiv Detail & Related papers (2024-06-18T06:43:04Z) - Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments [41.25558612970942]
We show that large language models (LLMs) exhibit preference biases and worrying sensitivity to prompt designs.
Motivated by this phenomenon, we propose an automatic Zero-shot Evaluation-oriented Prompt Optimization framework, ZEPO.
arXiv Detail & Related papers (2024-06-17T09:48:53Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities in assessing the quality of generated natural language.
LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments.
We introduce Pairwise-preference Search (PairS), an uncertainty-guided search method that employs LLMs to conduct pairwise comparisons and efficiently ranks candidate texts.
arXiv Detail & Related papers (2024-03-25T17:11:28Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Style Over Substance: Evaluation Biases for Large Language Models [17.13064447978519]
This study investigates the behavior of crowd-sourced and expert annotators, as well as large language models (LLMs)
Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors.
We propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score.
arXiv Detail & Related papers (2023-07-06T14:42:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.