Language Model Council: Benchmarking Foundation Models on Highly Subjective Tasks by Consensus
- URL: http://arxiv.org/abs/2406.08598v1
- Date: Wed, 12 Jun 2024 19:05:43 GMT
- Title: Language Model Council: Benchmarking Foundation Models on Highly Subjective Tasks by Consensus
- Authors: Justin Zhao, Flor Miriam Plaza-del-Arco, Amanda Cercas Curry,
- Abstract summary: Leaderboards like Arena rank Large Language Models (LLMs) based on how well their responses align with human preferences.
We propose a novel benchmarking framework, the Language Model Council (LMC)
The LMC operates through a democratic process to: 1) formulate a test set through equal participation, 2) administer the test among council members, and 3) evaluate responses as a collective jury.
- Score: 3.8436076642278754
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid advancement of Large Language Models (LLMs) necessitates robust and challenging benchmarks. Leaderboards like Chatbot Arena rank LLMs based on how well their responses align with human preferences. However, many tasks such as those related to emotional intelligence, creative writing, or persuasiveness, are highly subjective and often lack majoritarian human agreement. Judges may have irreconcilable disagreements about what constitutes a better response. To address the challenge of ranking LLMs on highly subjective tasks, we propose a novel benchmarking framework, the Language Model Council (LMC). The LMC operates through a democratic process to: 1) formulate a test set through equal participation, 2) administer the test among council members, and 3) evaluate responses as a collective jury. We deploy a council of 20 newest LLMs on an open-ended emotional intelligence task: responding to interpersonal dilemmas. Our results show that the LMC produces rankings that are more separable, robust, and less biased than those from any individual LLM judge, and is more consistent with a human-established leaderboard compared to other benchmarks.
Related papers
- On scalable oversight with weak LLMs judging strong LLMs [67.8628575615614]
We study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions.
We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models.
arXiv Detail & Related papers (2024-07-05T16:29:15Z) - Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges [6.609843448260634]
We study the performance of various large language models (LLMs) acting as judges.
We leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs.
We find that both Llama-3 70B and GPT-4 Turbo have an excellent alignment with humans, but in terms of ranking exam taker models, they are outperformed by both JudgeLM-7B and the lexical judge Contains.
arXiv Detail & Related papers (2024-06-18T13:49:54Z) - Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions [77.83767077859835]
We propose the Auto-Arena of LLMs, which automates the entire evaluation process with LLM agents.
In our experiment on the 17 newest LLMs, Auto-Arena shows the highest correlation with human preferences.
arXiv Detail & Related papers (2024-05-30T17:19:19Z) - Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition [46.949604465227054]
We propose a sample-efficient human evaluation method based on MAximum Discrepancy (MAD) competition.
MAD automatically selects a small set of informative and diverse instructions, each adapted to two LLMs.
The pairwise comparison results are then aggregated into a global ranking using the Elo rating system.
arXiv Detail & Related papers (2024-04-10T01:26:24Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations [10.709365940160685]
Modern large language models (LLMs) are hard to evaluate and compare automatically.
We propose a peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs.
We find that our approaches achieve higher accuracy and align better with human judgments.
arXiv Detail & Related papers (2023-07-06T04:05:44Z) - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [76.21004582932268]
We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases.
We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Arena, a crowdsourced battle platform.
arXiv Detail & Related papers (2023-06-09T05:55:52Z) - Can ChatGPT Assess Human Personalities? A General Evaluation Framework [70.90142717649785]
Large Language Models (LLMs) have produced impressive results in various areas, but their potential human-like psychology is still largely unexplored.
This paper presents a generic evaluation framework for LLMs to assess human personalities based on Myers Briggs Type Indicator (MBTI) tests.
arXiv Detail & Related papers (2023-03-01T06:16:14Z) - Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood.
We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.