Humans or LLMs as the Judge? A Study on Judgement Biases
- URL: http://arxiv.org/abs/2402.10669v3
- Date: Wed, 17 Apr 2024 09:56:26 GMT
- Title: Humans or LLMs as the Judge? A Study on Judgement Biases
- Authors: Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, Benyou Wang,
- Abstract summary: We propose a novel framework that is free from referencing groundtruth annotations for investigating Fallacy Oversight Bias, Authority Bias and Beauty Bias on LLM and human judges.
Results show that human and LLM judges are vulnerable to perturbations to various degrees, and that even the cutting-edge judges possess considerable biases.
We hope that our work can notify the community of the vulnerability of human- and LLM-as-a-judge against perturbations, as well as the urgency of developing robust evaluation systems.
- Score: 17.069314000437537
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adopting human and large language models (LLM) as judges (\textit{a.k.a} human- and LLM-as-a-judge) for evaluating the performance of LLMs has recently gained attention. Nonetheless, this approach concurrently introduces potential biases from human and LLM judges, questioning the reliability of the evaluation results. In this paper, we propose a novel framework that is free from referencing groundtruth annotations for investigating Fallacy Oversight Bias, Authority Bias and Beauty Bias on LLM and human judges. We curate a dataset referring to the revised Bloom's Taxonomy and conduct thousands of human and LLM evaluations. Results show that human and LLM judges are vulnerable to perturbations to various degrees, and that even the cutting-edge judges possess considerable biases. We further exploit their weakness and conduct attacks on LLM judges. We hope that our work can notify the community of the vulnerability of human- and LLM-as-a-judge against perturbations, as well as the urgency of developing robust evaluation systems.
Related papers
- LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks [106.09361690937618]
There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments.
In the absence of a comparison against human data, this raises concerns about the validity of these evaluations.
We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations.
arXiv Detail & Related papers (2024-06-26T14:56:13Z) - Finding Blind Spots in Evaluator LLMs with Interpretable Checklists [23.381287828102995]
We investigate the effectiveness of Large Language Models (LLMs) as evaluators for text generation tasks.
We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities.
arXiv Detail & Related papers (2024-06-19T10:59:48Z) - Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges [6.609843448260634]
We study the performance of various large language models (LLMs) acting as judges.
We leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs.
We find that both Llama-3 70B and GPT-4 Turbo have an excellent alignment with humans, but in terms of ranking exam taker models, they are outperformed by both JudgeLM-7B and the lexical judge Contains.
arXiv Detail & Related papers (2024-06-18T13:49:54Z) - Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions [77.83767077859835]
We propose the Auto-Arena of LLMs, which automates the entire evaluation process with LLM agents.
In our experiment on the 17 newest LLMs, Auto-Arena shows the highest correlation with human preferences.
arXiv Detail & Related papers (2024-05-30T17:19:19Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - JudgeLM: Fine-tuned Large Language Models are Scalable Judges [54.007823006976516]
We propose to fine-tune Large Language Models (LLMs) as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks.
We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges.
We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias.
arXiv Detail & Related papers (2023-10-26T17:48:58Z) - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [76.21004582932268]
We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases.
We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Arena, a crowdsourced battle platform.
arXiv Detail & Related papers (2023-06-09T05:55:52Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.