Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- URL: http://arxiv.org/abs/2306.05685v4
- Date: Sun, 24 Dec 2023 02:01:34 GMT
- Title: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- Authors: Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu,
Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang,
Joseph E. Gonzalez, Ion Stoica
- Abstract summary: We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases.
We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Arena, a crowdsourced battle platform.
- Score: 76.21004582932268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating large language model (LLM) based chat assistants is challenging
due to their broad capabilities and the inadequacy of existing benchmarks in
measuring human preferences. To address this, we explore using strong LLMs as
judges to evaluate these models on more open-ended questions. We examine the
usage and limitations of LLM-as-a-judge, including position, verbosity, and
self-enhancement biases, as well as limited reasoning ability, and propose
solutions to mitigate some of them. We then verify the agreement between LLM
judges and human preferences by introducing two benchmarks: MT-bench, a
multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our
results reveal that strong LLM judges like GPT-4 can match both controlled and
crowdsourced human preferences well, achieving over 80% agreement, the same
level of agreement between humans. Hence, LLM-as-a-judge is a scalable and
explainable way to approximate human preferences, which are otherwise very
expensive to obtain. Additionally, we show our benchmark and traditional
benchmarks complement each other by evaluating several variants of LLaMA and
Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with
human preferences are publicly available at
https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge.
Related papers
- JudgeBench: A Benchmark for Evaluating LLM-based Judges [61.048125269475854]
JudgeBench is a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding.
Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks.
arXiv Detail & Related papers (2024-10-16T17:58:19Z) - Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking [56.275521022148794]
Post-training methods claim superior alignment by virtue of better correspondence with human pairwise preferences.
We attempt to answer the question -- do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not?
We find that (1) LLM-judge preferences do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM-judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning stage of post-training, and not the PO stage, has the greatest impact on alignment.
arXiv Detail & Related papers (2024-09-23T17:58:07Z) - Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions [77.66677127535222]
Auto-Arena is an innovative framework that automates the entire evaluation process using LLM-powered agents.
In our experiments, Auto-Arena shows a 92.14% correlation with human preferences, surpassing all previous expert-annotated benchmarks.
arXiv Detail & Related papers (2024-05-30T17:19:19Z) - Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition [46.949604465227054]
We propose a sample-efficient human evaluation method based on MAximum Discrepancy (MAD) competition.
MAD automatically selects a small set of informative and diverse instructions, each adapted to two LLMs.
The pairwise comparison results are then aggregated into a global ranking using the Elo rating system.
arXiv Detail & Related papers (2024-04-10T01:26:24Z) - Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference [48.99117537559644]
We introduce Arena, an open platform for evaluating Large Language Models (LLMs) based on human preferences.
Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing.
This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using.
arXiv Detail & Related papers (2024-03-07T01:22:38Z) - JudgeLM: Fine-tuned Large Language Models are Scalable Judges [54.007823006976516]
We propose to fine-tune Large Language Models (LLMs) as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks.
We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges.
We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias.
arXiv Detail & Related papers (2023-10-26T17:48:58Z) - SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark [16.802854803128433]
We propose a comprehensive Chinese benchmark SuperCLUE, named after another popular Chinese LLM benchmark CLUE.
SuperCLUE encompasses three sub-tasks: actual users' queries and ratings derived from an LLM battle platform (CArena), open-ended questions with single and multiple-turn dialogues (OPEN), and closed-ended questions with the same stems as open-ended single-turn ones (CLOSE)
Our study shows that accuracy on closed-ended questions is insufficient to reflect human preferences achieved on open-ended ones.
arXiv Detail & Related papers (2023-07-27T17:24:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.