Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- URL: http://arxiv.org/abs/2306.05685v4
- Date: Sun, 24 Dec 2023 02:01:34 GMT
- Title: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- Authors: Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu,
Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang,
Joseph E. Gonzalez, Ion Stoica
- Abstract summary: We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases.
We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Arena, a crowdsourced battle platform.
- Score: 76.21004582932268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating large language model (LLM) based chat assistants is challenging
due to their broad capabilities and the inadequacy of existing benchmarks in
measuring human preferences. To address this, we explore using strong LLMs as
judges to evaluate these models on more open-ended questions. We examine the
usage and limitations of LLM-as-a-judge, including position, verbosity, and
self-enhancement biases, as well as limited reasoning ability, and propose
solutions to mitigate some of them. We then verify the agreement between LLM
judges and human preferences by introducing two benchmarks: MT-bench, a
multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our
results reveal that strong LLM judges like GPT-4 can match both controlled and
crowdsourced human preferences well, achieving over 80% agreement, the same
level of agreement between humans. Hence, LLM-as-a-judge is a scalable and
explainable way to approximate human preferences, which are otherwise very
expensive to obtain. Additionally, we show our benchmark and traditional
benchmarks complement each other by evaluating several variants of LLaMA and
Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with
human preferences are publicly available at
https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge.
Related papers
- WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia [59.96425443250666]
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs)
In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions based on contradictory passages from Wikipedia.
We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.
arXiv Detail & Related papers (2024-06-19T20:13:42Z) - Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges [6.609843448260634]
We study the performance of various large language models (LLMs) acting as judges.
We leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs.
We find that both Llama-3 70B and GPT-4 Turbo have an excellent alignment with humans, but in terms of ranking exam taker models, they are outperformed by both JudgeLM-7B and the lexical judge Contains.
arXiv Detail & Related papers (2024-06-18T13:49:54Z) - Language Model Council: Benchmarking Foundation Models on Highly Subjective Tasks by Consensus [3.8436076642278754]
Leaderboards like Arena rank Large Language Models (LLMs) based on how well their responses align with human preferences.
We propose a novel benchmarking framework, the Language Model Council (LMC)
The LMC operates through a democratic process to: 1) formulate a test set through equal participation, 2) administer the test among council members, and 3) evaluate responses as a collective jury.
arXiv Detail & Related papers (2024-06-12T19:05:43Z) - Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference [48.99117537559644]
We introduce Arena, an open platform for evaluating Large Language Models (LLMs) based on human preferences.
Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing.
This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using.
arXiv Detail & Related papers (2024-03-07T01:22:38Z) - InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal
Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence.
Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning.
We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z) - JudgeLM: Fine-tuned Large Language Models are Scalable Judges [54.007823006976516]
We propose to fine-tune Large Language Models (LLMs) as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks.
We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges.
We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias.
arXiv Detail & Related papers (2023-10-26T17:48:58Z) - BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues [72.65163468440434]
This report provides a preliminary evaluation of existing large language models for human-style multi-turn chatting.
We prompt large language models (LLMs) to generate a full multi-turn dialogue based on the ChatSEED, utterance by utterance.
We find GPT-4 can generate human-style multi-turn dialogues with impressive quality, significantly outperforms its counterparts.
arXiv Detail & Related papers (2023-10-20T16:53:51Z) - SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark [16.802854803128433]
We propose a comprehensive Chinese benchmark SuperCLUE, named after another popular Chinese LLM benchmark CLUE.
SuperCLUE encompasses three sub-tasks: actual users' queries and ratings derived from an LLM battle platform (CArena), open-ended questions with single and multiple-turn dialogues (OPEN), and closed-ended questions with the same stems as open-ended single-turn ones (CLOSE)
Our study shows that accuracy on closed-ended questions is insufficient to reflect human preferences achieved on open-ended ones.
arXiv Detail & Related papers (2023-07-27T17:24:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.