Style Over Substance: Evaluation Biases for Large Language Models
- URL: http://arxiv.org/abs/2307.03025v3
- Date: Sun, 12 Nov 2023 06:25:32 GMT
- Title: Style Over Substance: Evaluation Biases for Large Language Models
- Authors: Minghao Wu, Alham Fikri Aji
- Abstract summary: This study investigates the behavior of crowd-sourced and expert annotators, as well as large language models (LLMs)
Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors.
We propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score.
- Score: 17.13064447978519
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) continue to advance, accurately and
comprehensively evaluating their performance becomes increasingly challenging.
Ranking the relative performance of LLMs based on Elo ratings, according to
human judgment, is gaining more popularity. However, the extent to which humans
and LLMs are capable evaluators remains uncertain. This study investigates the
behavior of crowd-sourced and expert annotators, as well as LLMs, when
comparing outputs from different models. To achieve this, we curate a dataset
of intentionally flawed machine-generated answers. Our findings reveal a
concerning bias in the evaluation process, as answers with factual errors are
rated more favorably than answers that are too short or contained grammatical
errors. To address this issue, we propose independently evaluating
machine-generated text across multiple dimensions, rather than merging all the
evaluation aspects into a single score. We instantiate this idea with the Elo
rating system, resulting in the Multi-Elo Rating System (MERS). Empirical
results from our study reveal that this proposed approach significantly
enhances the quality of LLM-based evaluations, particularly in terms of factual
accuracy. However, there is no significant improvement in crowd-sourced-based
evaluations, indicating the need for further investigation.
Related papers
- Finding Blind Spots in Evaluator LLMs with Interpretable Checklists [23.381287828102995]
We investigate the effectiveness of Large Language Models (LLMs) as evaluators for text generation tasks.
We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities.
arXiv Detail & Related papers (2024-06-19T10:59:48Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition [46.949604465227054]
We propose a sample-efficient human evaluation method based on MAximum Discrepancy (MAD) competition.
MAD automatically selects a small set of informative and diverse instructions, each adapted to two LLMs.
The pairwise comparison results are then aggregated into a global ranking using the Elo rating system.
arXiv Detail & Related papers (2024-04-10T01:26:24Z) - PRE: A Peer Review Based Large Language Model Evaluator [14.585292530642603]
Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs.
We propose a novel framework that can automatically evaluate LLMs through a peer-review process.
arXiv Detail & Related papers (2024-01-28T12:33:14Z) - F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - Elo Uncovered: Robustness and Best Practices in Language Model
Evaluation [9.452326973655447]
We study two axioms that evaluation methods should adhere to: reliability and transitivity.
We show that these axioms are not always satisfied raising questions about the reliability of current comparative evaluations of LLMs.
arXiv Detail & Related papers (2023-11-29T00:45:23Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - Benchmarking Cognitive Biases in Large Language Models as Evaluators [16.845939677403287]
Large Language Models (LLMs) have been shown to be effective as automatic evaluators with simple prompting and in-context learning.
We evaluate the quality of ranking outputs introducing the Cognitive Bias Benchmark for LLMs as Evaluators.
We find that LLMs are biased text quality evaluators, exhibiting strong indications on our bias benchmark.
arXiv Detail & Related papers (2023-09-29T06:53:10Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z) - Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood.
We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.