Large Language Models are not Fair Evaluators
- URL: http://arxiv.org/abs/2305.17926v2
- Date: Wed, 30 Aug 2023 13:22:35 GMT
- Title: Large Language Models are not Fair Evaluators
- Authors: Peiyi Wang and Lei Li and Liang Chen and Zefan Cai and Dawei Zhu and
Binghuai Lin and Yunbo Cao and Qi Liu and Tianyu Liu and Zhifang Sui
- Abstract summary: We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context.
This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other.
We propose a framework with three simple yet effective strategies to mitigate this issue.
- Score: 60.27164804083752
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we uncover a systematic bias in the evaluation paradigm of
adopting large language models~(LLMs), e.g., GPT-4, as a referee to score and
compare the quality of responses generated by candidate models. We find that
the quality ranking of candidate responses can be easily hacked by simply
altering their order of appearance in the context. This manipulation allows us
to skew the evaluation result, making one model appear considerably superior to
the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries
with ChatGPT as an evaluator. To address this issue, we propose a calibration
framework with three simple yet effective strategies: 1) Multiple Evidence
Calibration, which requires the evaluator model to generate multiple evaluation
evidence before assigning ratings; 2) Balanced Position Calibration, which
aggregates results across various orders to determine the final score; 3)
Human-in-the-Loop Calibration, which introduces a balanced position diversity
entropy to measure the difficulty of each example and seeks human assistance
when needed. We also manually annotate the "win/tie/lose" outcomes of responses
from ChatGPT and Vicuna-13B in the Vicuna Benchmark's question prompt, and
extensive experiments demonstrate that our approach successfully mitigates
evaluation bias, resulting in closer alignment with human judgments. We release
our code and human annotation at \url{https://github.com/i-Eval/FairEval} to
facilitate future research.
Related papers
- JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance.
We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods.
In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z) - Direct Judgement Preference Optimization [66.83088028268318]
We train large language models (LLMs) as generative judges to evaluate and critique other models' outputs.
We employ three approaches to collect the preference pairs for different use cases, each aimed at improving our generative judge from a different perspective.
Our model robustly counters inherent biases such as position and length bias, flexibly adapts to any evaluation protocol specified by practitioners, and provides helpful language feedback for improving downstream generator models.
arXiv Detail & Related papers (2024-09-23T02:08:20Z) - Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges [6.609843448260634]
The LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models.
This paper focuses on a clean scenario in which inter-human agreement is high.
We identify vulnerabilities in judge models, such as their sensitivity to prompt complexity and length, and a tendency toward leniency.
arXiv Detail & Related papers (2024-06-18T13:49:54Z) - Aligning Large Language Models by On-Policy Self-Judgment [49.31895979525054]
Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning.
We present a novel alignment framework, SELF-JUDGE, that does on-policy learning and is parameter efficient.
We show that the rejecting sampling by itself can improve performance further without an additional evaluator.
arXiv Detail & Related papers (2024-02-17T11:25:26Z) - Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large
Language Models on Sequence to Sequence Tasks [9.801767683867125]
We provide a preliminary and hybrid evaluation on three NLP benchmarks using both automatic and human evaluation.
We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics.
We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks.
arXiv Detail & Related papers (2023-10-20T20:17:09Z) - Peering Through Preferences: Unraveling Feedback Acquisition for
Aligning Large Language Models [32.843361525236965]
We analyze the effect of sparse feedback on the alignment and evaluation of large language models.
We find that preferences from ratings and rankings significantly disagree 60% for both human and AI annotators.
Our findings shed light on critical gaps in methods for evaluating the real-world utility of language models.
arXiv Detail & Related papers (2023-08-30T07:35:32Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z) - Calibrate Before Use: Improving Few-Shot Performance of Language Models [68.17016463756474]
GPT-3 can perform numerous tasks when provided a natural language prompt that contains a few training examples.
We show that this type of few-shot learning can be unstable.
The choice of prompt format, training examples, and even the order of the training examples can cause accuracy to vary from near chance to near state-of-the-art.
arXiv Detail & Related papers (2021-02-19T00:23:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.