A Comprehensive Analysis of the Effectiveness of Large Language Models
as Automatic Dialogue Evaluators
- URL: http://arxiv.org/abs/2312.15407v2
- Date: Sat, 20 Jan 2024 06:26:33 GMT
- Title: A Comprehensive Analysis of the Effectiveness of Large Language Models
as Automatic Dialogue Evaluators
- Authors: Chen Zhang, Luis Fernando D'Haro, Yiming Chen, Malu Zhang, Haizhou Li
- Abstract summary: Large language models (LLMs) are shown to be promising substitutes for human judges.
We analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels.
We also probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels.
- Score: 46.939611070781794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic evaluation is an integral aspect of dialogue system research. The
traditional reference-based NLG metrics are generally found to be unsuitable
for dialogue assessment. Consequently, recent studies have suggested various
unique, reference-free neural metrics that better align with human evaluations.
Notably among them, large language models (LLMs), particularly the
instruction-tuned variants like ChatGPT, are shown to be promising substitutes
for human judges. Yet, existing works on utilizing LLMs for automatic dialogue
evaluation are limited in their scope in terms of the number of meta-evaluation
datasets, mode of evaluation, coverage of LLMs, etc. Hence, it remains
inconclusive how effective these LLMs are. To this end, we conduct a
comprehensive study on the application of LLMs for automatic dialogue
evaluation. Specifically, we analyze the multi-dimensional evaluation
capability of 30 recently emerged LLMs at both turn and dialogue levels, using
a comprehensive set of 12 meta-evaluation datasets. Additionally, we probe the
robustness of the LLMs in handling various adversarial perturbations at both
turn and dialogue levels. Finally, we explore how model-level and
dimension-level ensembles impact the evaluation performance. All resources are
available at https://github.com/e0397123/comp-analysis.
Related papers
- On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation [8.672875654352689]
Large Language Models (LLMs) have showcased remarkable capabilities in various Natural Language Processing tasks.
This paper critically examines current evaluation benchmarks, highlighting that the use of older response generators and quality aspects fail to accurately reflect modern chatbots capabilities.
arXiv Detail & Related papers (2024-07-04T11:14:47Z) - Leveraging LLMs for Dialogue Quality Measurement [27.046917937460798]
Large language models (LLMs) show robust zeroshot and few-shot capabilities across NLP tasks.
Manipulating factors such as model size, in-context examples, and selection techniques, we examine "chain-of-thought" (CoT) reasoning and label extraction procedures.
Our results indicate that LLMs that are suitably fine-tuned and have sufficient reasoning capabilities can be leveraged for automated dialogue evaluation.
arXiv Detail & Related papers (2024-06-25T06:19:47Z) - Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue [1.8652965834931452]
We study the limitations of Large Language Models (LLMs) for the task of response generation in human-machine dialogue.
We extensively analyze different LLM adaptation techniques when applied to different dialogue types.
arXiv Detail & Related papers (2024-06-10T15:52:49Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z) - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [49.500322937449326]
Multimodal large language models (MLLMs) have broadened the scope of AI applications.
Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences.
We propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge.
arXiv Detail & Related papers (2023-11-23T12:04:25Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - Simple LLM Prompting is State-of-the-Art for Robust and Multilingual
Dialogue Evaluation [7.767020408405403]
We propose a novel framework that takes advantage of the strengths of current evaluation models with the newly-established paradigm of prompting Large Language Models (LLMs)
Empirical results show our framework achieves state of the art results in terms of mean Spearman correlation scores across several benchmarks.
arXiv Detail & Related papers (2023-08-31T15:19:28Z) - LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain
Conversations with Large Language Models [28.441725610692714]
We propose a unified multi-dimensional automatic evaluation method for open-domain conversations with large language models (LLMs)
We design a single prompt-based evaluation method that leverages a unified evaluation schema to cover multiple dimensions of conversation quality in a single model call.
We extensively evaluate the performance of LLM-Eval on various benchmark datasets, demonstrating its effectiveness, efficiency, and adaptability compared to state-of-the-art evaluation methods.
arXiv Detail & Related papers (2023-05-23T05:57:09Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.