Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
- URL: http://arxiv.org/abs/2403.04132v1
- Date: Thu, 7 Mar 2024 01:22:38 GMT
- Title: Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
- Authors: Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas
Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan,
Joseph E. Gonzalez, Ion Stoica
- Abstract summary: We introduce Arena, an open platform for evaluating Large Language Models (LLMs) based on human preferences.
Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing.
This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using.
- Score: 48.99117537559644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have unlocked new capabilities and applications;
however, evaluating the alignment with human preferences still poses
significant challenges. To address this issue, we introduce Chatbot Arena, an
open platform for evaluating LLMs based on human preferences. Our methodology
employs a pairwise comparison approach and leverages input from a diverse user
base through crowdsourcing. The platform has been operational for several
months, amassing over 240K votes. This paper describes the platform, analyzes
the data we have collected so far, and explains the tried-and-true statistical
methods we are using for efficient and accurate evaluation and ranking of
models. We confirm that the crowdsourced questions are sufficiently diverse and
discriminating and that the crowdsourced human votes are in good agreement with
those of expert raters. These analyses collectively establish a robust
foundation for the credibility of Chatbot Arena. Because of its unique value
and openness, Chatbot Arena has emerged as one of the most referenced LLM
leaderboards, widely cited by leading LLM developers and companies. Our demo is
publicly available at \url{https://chat.lmsys.org}.
Related papers
- Show, Don't Tell: Aligning Language Models with Demonstrated Feedback [54.10302745921713]
Demonstration ITerated Task Optimization (DITTO) directly aligns language model outputs to a user's demonstrated behaviors.
We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts.
arXiv Detail & Related papers (2024-06-02T23:13:56Z) - Large Language Models' Detection of Political Orientation in Newspapers [0.0]
Various methods have been developed to better understand newspapers' positioning.
The advent of Large Language Models (LLM) hold disruptive potential to assist researchers and citizens alike.
We compare how four widely employed LLMs rate the positioning of newspapers, and compare if their answers align with one another.
Over a woldwide dataset, articles in newspapers are positioned strikingly differently by single LLMs, hinting to inconsistent training or excessive randomness in the algorithms.
arXiv Detail & Related papers (2024-05-23T06:18:03Z) - Evaluating Large Language Models with Human Feedback: Establishing a Swedish Benchmark [0.0]
Large language models (LLMs) have demonstrated significant capabilities across numerous applications.
This study introduces a comprehensive human benchmark to assess the efficacy of prominent LLMs in understanding and generating Swedish language texts.
arXiv Detail & Related papers (2024-05-22T21:22:51Z) - Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models [61.45529177682614]
We challenge the prevailing constrained evaluation paradigm for values and opinions in large language models.
We show that models give substantively different answers when not forced.
We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.
arXiv Detail & Related papers (2024-02-26T18:00:49Z) - UltraFeedback: Boosting Language Models with Scaled AI Feedback [99.4633351133207]
We present textscUltraFeedback, a large-scale, high-quality, and diversified AI feedback dataset.
Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models.
arXiv Detail & Related papers (2023-10-02T17:40:01Z) - PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations [10.709365940160685]
Modern large language models (LLMs) are hard to evaluate and compare automatically.
We propose a peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs.
We find that our approaches achieve higher accuracy and align better with human judgments.
arXiv Detail & Related papers (2023-07-06T04:05:44Z) - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [76.21004582932268]
We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases.
We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Arena, a crowdsourced battle platform.
arXiv Detail & Related papers (2023-06-09T05:55:52Z) - Enhancing Chat Language Models by Scaling High-quality Instructional
Conversations [91.98516412612739]
We first provide a systematically designed, diverse, informative, large-scale dataset of instructional conversations, UltraChat.
Our objective is to capture the breadth of interactions that a human might have with an AI assistant.
We fine-tune a LLaMA model to create a powerful conversational model, UltraLLaMA.
arXiv Detail & Related papers (2023-05-23T16:49:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.