Simple LLM Prompting is State-of-the-Art for Robust and Multilingual
Dialogue Evaluation
- URL: http://arxiv.org/abs/2308.16797v2
- Date: Fri, 8 Sep 2023 11:24:06 GMT
- Title: Simple LLM Prompting is State-of-the-Art for Robust and Multilingual
Dialogue Evaluation
- Authors: John Mendon\c{c}a, Patr\'icia Pereira, Helena Moniz, Jo\~ao Paulo
Carvalho, Alon Lavie, Isabel Trancoso
- Abstract summary: We propose a novel framework that takes advantage of the strengths of current evaluation models with the newly-established paradigm of prompting Large Language Models (LLMs)
Empirical results show our framework achieves state of the art results in terms of mean Spearman correlation scores across several benchmarks.
- Score: 7.767020408405403
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite significant research effort in the development of automatic dialogue
evaluation metrics, little thought is given to evaluating dialogues other than
in English. At the same time, ensuring metrics are invariant to semantically
similar responses is also an overlooked topic. In order to achieve the desired
properties of robustness and multilinguality for dialogue evaluation metrics,
we propose a novel framework that takes advantage of the strengths of current
evaluation models with the newly-established paradigm of prompting Large
Language Models (LLMs). Empirical results show our framework achieves state of
the art results in terms of mean Spearman correlation scores across several
benchmarks and ranks first place on both the Robust and Multilingual tasks of
the DSTC11 Track 4 "Automatic Evaluation Metrics for Open-Domain Dialogue
Systems", proving the evaluation capabilities of prompted LLMs.
Related papers
- ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark [26.100299485985197]
ComperDial consists of human-scored responses for 10,395 dialogue turns in 1,485 conversations collected from 99 dialogue agents.
In addition to single-turn response scores, ComperDial also contains dialogue-level human-annotated scores.
Building off ComperDial, we devise a new automatic evaluation metric to measure the general similarity of model-generated dialogues to human conversations.
arXiv Detail & Related papers (2024-06-17T05:51:04Z) - SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation [23.203761925540736]
We propose a novel framework SLIDE (Small and Large Integrated for Dialogue Evaluation)
Our approach achieves state-of-the-art performance in both the classification and evaluation tasks, and additionally the SLIDE exhibits better correlation with human evaluators.
arXiv Detail & Related papers (2024-05-24T20:32:49Z) - Emphasising Structured Information: Integrating Abstract Meaning Representation into LLMs for Enhanced Open-Domain Dialogue Evaluation [26.330012489735456]
This paper proposes an effective framework for open-domain dialogue evaluation.
It combines domain-specific language models (SLMs) enhanced with Abstract Meaning Representation (AMR) knowledge with Large Language Models (LLMs)
Experimental results on open-domain dialogue evaluation tasks demonstrate the superiority of our method compared to a wide range of state-of-the-art baselines.
arXiv Detail & Related papers (2024-04-01T14:11:45Z) - A Comprehensive Analysis of the Effectiveness of Large Language Models
as Automatic Dialogue Evaluators [46.939611070781794]
Large language models (LLMs) are shown to be promising substitutes for human judges.
We analyze the multi-dimensional evaluation capability of 30 recently emerged LLMs at both turn and dialogue levels.
We also probe the robustness of the LLMs in handling various adversarial perturbations at both turn and dialogue levels.
arXiv Detail & Related papers (2023-12-24T04:50:57Z) - Cue-CoT: Chain-of-thought Prompting for Responding to In-depth Dialogue
Questions with LLMs [59.74002011562726]
We propose a novel linguistic cue-based chain-of-thoughts (textitCue-CoT) to provide a more personalized and engaging response.
We build a benchmark with in-depth dialogue questions, consisting of 6 datasets in both Chinese and English.
Empirical results demonstrate our proposed textitCue-CoT method outperforms standard prompting methods in terms of both textithelpfulness and textitacceptability on all datasets.
arXiv Detail & Related papers (2023-05-19T16:27:43Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation [58.46761798403072]
We propose a dialogue-level metric that consists of three sub-metrics with each targeting a specific dimension.
The sub-metrics are trained with novel self-supervised objectives and exhibit strong correlations with human judgment for their respective dimensions.
Compared to the existing state-of-the-art metric, the combined metrics achieve around 16% relative improvement on average.
arXiv Detail & Related papers (2022-10-25T08:26:03Z) - GODEL: Large-Scale Pre-Training for Goal-Directed Dialog [119.1397031992088]
We introduce GODEL, a large pre-trained language model for dialog.
We show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups.
A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses.
arXiv Detail & Related papers (2022-06-22T18:19:32Z) - MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for
Evaluating Open-Domain Dialogue [15.31433922183745]
We propose a Multi-Metric Evaluation based on Correlation Re-Scaling (MME-CRS) for evaluating open-domain dialogue.
MME-CRS ranks first on the final test data of DSTC10 track5 subtask1 Automatic Open-domain Dialogue Evaluation Challenge with a large margin.
arXiv Detail & Related papers (2022-06-19T13:43:59Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.