Collaborative Evaluation: Exploring the Synergy of Large Language Models
and Humans for Open-ended Generation Evaluation
- URL: http://arxiv.org/abs/2310.19740v1
- Date: Mon, 30 Oct 2023 17:04:35 GMT
- Title: Collaborative Evaluation: Exploring the Synergy of Large Language Models
and Humans for Open-ended Generation Evaluation
- Authors: Qintong Li, Leyang Cui, Lingpeng Kong, Wei Bi
- Abstract summary: Large language models (LLMs) have emerged as a scalable and cost-effective alternative to human evaluations.
We propose a Collaborative Evaluation pipeline CoEval, involving the design of a checklist of task-specific criteria and the detailed evaluation of texts.
- Score: 71.76872586182981
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans are widely involved in the evaluation of open-ended natural language
generation tasks (NLG) that demand creativity, as automatic metrics often
exhibit weak correlations with human judgments. Large language models (LLMs)
recently have emerged as a scalable and cost-effective alternative to human
evaluations. However, both humans and LLMs have limitations, i.e., inherent
subjectivity and unreliable judgments, particularly for open-ended tasks that
require adaptable metrics tailored to diverse task requirements. To explore the
synergy between humans and LLM-based evaluators and address the challenges of
existing inconsistent evaluation criteria in open-ended NLG tasks, we propose a
Collaborative Evaluation pipeline CoEval, involving the design of a checklist
of task-specific criteria and the detailed evaluation of texts, in which LLM
generates initial ideation, and then humans engage in scrutiny. We conducted a
series of experiments to investigate the mutual effects between LLMs and humans
in CoEval. Results show that, by utilizing LLMs, CoEval effectively evaluates
lengthy texts, saving significant time and reducing human evaluation outliers.
Human scrutiny still plays a role, revising around 20% of LLM evaluation scores
for ultimate reliability.
Related papers
- Optimizing the role of human evaluation in LLM-based spoken document summarization systems [0.0]
We propose an evaluation paradigm for spoken document summarization explicitly tailored for generative AI content.
We provide detailed evaluation criteria and best practices guidelines to ensure robustness in the experimental design, replicability, and trustworthiness of human evaluations.
arXiv Detail & Related papers (2024-10-23T18:37:14Z) - Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text [12.879551933541345]
Large Language Models (LLMs) are capable of generating human-like conversations.
Conventional metrics like BLEU and ROUGE are inadequate for capturing the subtle semantics and contextual richness of such generative outputs.
We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs-as-judges.
arXiv Detail & Related papers (2024-08-17T16:01:45Z) - ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z) - Understanding Social Reasoning in Language Models with Language Models [34.068368860882586]
We present a novel framework for generating evaluations with Large Language Models (LLMs) by populating causal templates.
We create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5,000 model-written evaluations.
We find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations.
arXiv Detail & Related papers (2023-06-21T16:42:15Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.