Related papers: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

URL: http://arxiv.org/abs/2306.05685v4
Date: Sun, 24 Dec 2023 02:01:34 GMT
Title: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Authors: Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica
Abstract summary: We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Arena, a crowdsourced battle platform.
Score: 76.21004582932268
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with human preferences are publicly available at https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge.

Related papers

Arbiters of Ambivalence: Challenges of Using LLMs in No-Consensus Tasks [52.098988739649705]
This study examines the biases and limitations of LLMs in three roles: answer generator, judge, and debater.<n>We develop a no-consensus'' benchmark by curating examples that encompass a variety of a priori ambivalent scenarios.<n>Our results show that while LLMs can provide nuanced assessments when generating open-ended answers, they tend to take a stance on no-consensus topics when employed as judges or debaters.
arXiv Detail & Related papers (2025-05-28T01:31:54Z)
No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding [3.1674468740431396]
We evaluate how well LLM Judges can grade whether a given response to a conversational question is correct. We source questions from a combination of existing datasets and a novel, challenging benchmark (BFF-Bench) created for this analysis. We show that providing a weaker judge with higher quality references reaches better agreement with human annotators than a stronger judge.
arXiv Detail & Related papers (2025-03-07T00:42:08Z)
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge [32.55871325700294]
Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP) Recent advancements in Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm.
arXiv Detail & Related papers (2024-11-25T17:28:44Z)
MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems [43.19298196163617]
We present a simple efficient technique to combine the best of both worlds.<n>We develop MIRAGE-Bench, a synthetic arena-based RAG benchmark for 18 diverse languages on Wikipedia.
arXiv Detail & Related papers (2024-10-17T16:18:49Z)
JudgeBench: A Benchmark for Evaluating LLM-based Judges [61.048125269475854]
JudgeBench is a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks.
arXiv Detail & Related papers (2024-10-16T17:58:19Z)
Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking [56.275521022148794]
Post-training methods claim superior alignment by virtue of better correspondence with human pairwise preferences. We attempt to answer the question -- do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not? We find that (1) LLM-judge preferences do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM-judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning stage of post-training, and not the PO stage, has the greatest impact on alignment.
arXiv Detail & Related papers (2024-09-23T17:58:07Z)
Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions [77.66677127535222]
Auto-Arena is an innovative framework that automates the entire evaluation process using LLM-powered agents. In our experiments, Auto-Arena shows a 92.14% correlation with human preferences, surpassing all previous expert-annotated benchmarks.
arXiv Detail & Related papers (2024-05-30T17:19:19Z)
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference [48.99117537559644]
We introduce Arena, an open platform for evaluating Large Language Models (LLMs) based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using.
arXiv Detail & Related papers (2024-03-07T01:22:38Z)
JudgeLM: Fine-tuned Large Language Models are Scalable Judges [54.007823006976516]
We propose to fine-tune Large Language Models (LLMs) as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias.
arXiv Detail & Related papers (2023-10-26T17:48:58Z)
SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark [16.802854803128433]
We propose a comprehensive Chinese benchmark SuperCLUE, named after another popular Chinese LLM benchmark CLUE. SuperCLUE encompasses three sub-tasks: actual users' queries and ratings derived from an LLM battle platform (CArena), open-ended questions with single and multiple-turn dialogues (OPEN), and closed-ended questions with the same stems as open-ended single-turn ones (CLOSE) Our study shows that accuracy on closed-ended questions is insufficient to reflect human preferences achieved on open-ended ones.
arXiv Detail & Related papers (2023-07-27T17:24:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.