ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
- URL: http://arxiv.org/abs/2308.07201v1
- Date: Mon, 14 Aug 2023 15:13:04 GMT
- Title: ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
- Authors: Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang
Zhang, Jie Fu, Zhiyuan Liu
- Abstract summary: We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
- Score: 57.71597869337909
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text evaluation has historically posed significant challenges, often
demanding substantial labor and time cost. With the emergence of large language
models (LLMs), researchers have explored LLMs' potential as alternatives for
human evaluation. While these single-agent-based approaches show promise,
experimental results suggest that further advancements are needed to bridge the
gap between their current effectiveness and human-level evaluation quality.
Recognizing that best practices of human evaluation processes often involve
multiple human annotators collaborating in the evaluation, we resort to a
multi-agent debate framework, moving beyond single-agent prompting strategies.
The multi-agent-based approach enables a group of LLMs to synergize with an
array of intelligent counterparts, harnessing their distinct capabilities and
expertise to enhance efficiency and effectiveness in handling intricate tasks.
In this paper, we construct a multi-agent referee team called ChatEval to
autonomously discuss and evaluate the quality of generated responses from
different models on open-ended questions and traditional natural language
generation (NLG) tasks. Our analysis shows that ChatEval transcends mere
textual scoring, offering a human-mimicking evaluation process for reliable
assessments. Our code is available at https://github.com/chanchimin/ChatEval.
Related papers
- Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text [12.879551933541345]
Large Language Models (LLMs) are capable of generating human-like conversations.
Conventional metrics like BLEU and ROUGE are inadequate for capturing the subtle semantics and contextual richness of such generative outputs.
We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs-as-judges.
arXiv Detail & Related papers (2024-08-17T16:01:45Z) - Evaluating the Performance of Large Language Models via Debates [43.40134389150456]
We propose an automated benchmarking framework based on debates between Large Language Models (LLMs)
This method assesses not only domain knowledge, but also skills such as problem definition and inconsistency recognition.
We evaluate the performance of various state-of-the-art LLMs using the debate framework and achieve rankings that align closely with popular rankings based on human input.
arXiv Detail & Related papers (2024-06-16T19:02:31Z) - The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches [0.0]
We discuss the issue with the increasingly popular LLM based evaluations and how they correlate with human evaluations.
We introduce a comprehensive factored evaluation mechanism that can be utilized in conjunction with both human and LLM-based evaluations.
Results show that factor based evaluation produces better insights on which aspects need to be improved in LLM applications.
arXiv Detail & Related papers (2024-06-05T14:55:10Z) - Unveiling the Achilles' Heel of NLG Evaluators: A Unified Adversarial Framework Driven by Large Language Models [52.368110271614285]
We introduce AdvEval, a novel black-box adversarial framework against NLG evaluators.
AdvEval is specially tailored to generate data that yield strong disagreements between human and victim evaluators.
We conduct experiments on 12 victim evaluators and 11 NLG datasets, spanning tasks including dialogue, summarization, and question evaluation.
arXiv Detail & Related papers (2024-05-23T14:48:15Z) - DEBATE: Devil's Advocate-Based Assessment and Text Evaluation [6.2689399557794525]
We propose DEBATE, an NLG evaluation framework based on multi-agent scoring system.
Within the framework, one agent is instructed to criticize other agents' arguments.
We show that the extensiveness of debates among agents and the persona of an agent can influence the performance of evaluators.
arXiv Detail & Related papers (2024-05-16T09:41:12Z) - Large Multimodal Agents: A Survey [78.81459893884737]
Large language models (LLMs) have achieved superior performance in powering text-based AI agents.
There is an emerging research trend focused on extending these LLM-powered AI agents into the multimodal domain.
This review aims to provide valuable insights and guidelines for future research in this rapidly evolving field.
arXiv Detail & Related papers (2024-02-23T06:04:23Z) - Collaborative Evaluation: Exploring the Synergy of Large Language Models
and Humans for Open-ended Generation Evaluation [71.76872586182981]
Large language models (LLMs) have emerged as a scalable and cost-effective alternative to human evaluations.
We propose a Collaborative Evaluation pipeline CoEval, involving the design of a checklist of task-specific criteria and the detailed evaluation of texts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.