Overview of Robust and Multilingual Automatic Evaluation Metrics for
Open-Domain Dialogue Systems at DSTC 11 Track 4
- URL: http://arxiv.org/abs/2306.12794v3
- Date: Thu, 14 Sep 2023 01:33:36 GMT
- Title: Overview of Robust and Multilingual Automatic Evaluation Metrics for
Open-Domain Dialogue Systems at DSTC 11 Track 4
- Authors: Mario Rodr\'iguez-Cantelar and Chen Zhang and Chengguang Tang and Ke
Shi and Sarik Ghazarian and Jo\~ao Sedoc and Luis Fernando D'Haro and
Alexander Rudnicky
- Abstract summary: This track in the 11th Dialogue System Technology Challenge (DSTC11) is part of the ongoing effort to promote robust and multilingual automatic evaluation metrics.
This article describes the datasets and baselines provided to participants and discusses the submission and result details of the two proposed subtasks.
- Score: 51.142614461563184
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advent and fast development of neural networks have revolutionized the
research on dialogue systems and subsequently have triggered various challenges
regarding their automatic evaluation. Automatic evaluation of open-domain
dialogue systems as an open challenge has been the center of the attention of
many researchers. Despite the consistent efforts to improve automatic metrics'
correlations with human evaluation, there have been very few attempts to assess
their robustness over multiple domains and dimensions. Also, their focus is
mainly on the English language. All of these challenges prompt the development
of automatic evaluation metrics that are reliable in various domains,
dimensions, and languages. This track in the 11th Dialogue System Technology
Challenge (DSTC11) is part of the ongoing effort to promote robust and
multilingual automatic evaluation metrics. This article describes the datasets
and baselines provided to participants and discusses the submission and result
details of the two proposed subtasks.
Related papers
- What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation [57.550045763103334]
evaluating a story can be more challenging than other generation evaluation tasks.
We first summarize existing storytelling tasks, including text-to-text, visual-to-text, and text-to-visual.
We propose a taxonomy to organize evaluation metrics that have been developed or can be adopted for story evaluation.
arXiv Detail & Related papers (2024-08-26T20:35:42Z) - ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark [26.100299485985197]
ComperDial consists of human-scored responses for 10,395 dialogue turns in 1,485 conversations collected from 99 dialogue agents.
In addition to single-turn response scores, ComperDial also contains dialogue-level human-annotated scores.
Building off ComperDial, we devise a new automatic evaluation metric to measure the general similarity of model-generated dialogues to human conversations.
arXiv Detail & Related papers (2024-06-17T05:51:04Z) - CADS: A Systematic Literature Review on the Challenges of Abstractive Dialogue Summarization [7.234196390284036]
This article summarizes the research on Transformer-based abstractive summarization for English dialogues.
We cover the main challenges present in dialog summarization (i.e., language, structure, comprehension, speaker, salience, and factuality)
We find that while some challenges, like language, have seen considerable progress, others, such as comprehension, factuality, and salience, remain difficult and hold significant research opportunities.
arXiv Detail & Related papers (2024-06-11T17:30:22Z) - Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation [50.60733773088296]
We conduct a comprehensive human evaluation of the results of several shared tasks from the last International Workshop on Spoken Language Translation (IWSLT 2023)
We propose an effective evaluation strategy based on automatic resegmentation and direct assessment with segment context.
Our analysis revealed that: 1) the proposed evaluation strategy is robust and scores well-correlated with other types of human judgements; 2) automatic metrics are usually, but not always, well-correlated with direct assessment scores; and 3) COMET as a slightly stronger automatic metric than chrF.
arXiv Detail & Related papers (2024-06-06T09:18:42Z) - DIALIGHT: Lightweight Multilingual Development and Evaluation of
Task-Oriented Dialogue Systems with Large Language Models [76.79929883963275]
DIALIGHT is a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems.
It features a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level.
Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses.
arXiv Detail & Related papers (2024-01-04T11:27:48Z) - Simple LLM Prompting is State-of-the-Art for Robust and Multilingual
Dialogue Evaluation [7.767020408405403]
We propose a novel framework that takes advantage of the strengths of current evaluation models with the newly-established paradigm of prompting Large Language Models (LLMs)
Empirical results show our framework achieves state of the art results in terms of mean Spearman correlation scores across several benchmarks.
arXiv Detail & Related papers (2023-08-31T15:19:28Z) - Evaluating Open-Domain Dialogues in Latent Space with Next Sentence
Prediction and Mutual Information [18.859159491548006]
We propose a novel learning-based automatic evaluation metric (CMN) for open-domain dialogues.
We employ Conditional Variational Autoencoders (CVAEs) with a Next Sentence Prediction (NSP) objective and employing Mutual Information (MI) to model the semantic similarity of text in the latent space.
Experimental results on two open-domain dialogue datasets demonstrate the superiority of our method compared with a wide range of baselines.
arXiv Detail & Related papers (2023-05-26T14:21:54Z) - Automatic Evaluation and Moderation of Open-domain Dialogue Systems [59.305712262126264]
A long standing challenge that bothers the researchers is the lack of effective automatic evaluation metrics.
This paper describes the data, baselines and results obtained for the Track 5 at the Dialogue System Technology Challenge 10 (DSTC10)
arXiv Detail & Related papers (2021-11-03T10:08:05Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.