FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation
- URL: http://arxiv.org/abs/2210.13832v1
- Date: Tue, 25 Oct 2022 08:26:03 GMT
- Title: FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation
- Authors: Chen Zhang, Luis Fernando D'Haro, Qiquan Zhang, Thomas Friedrichs,
Haizhou Li
- Abstract summary: We propose a dialogue-level metric that consists of three sub-metrics with each targeting a specific dimension.
The sub-metrics are trained with novel self-supervised objectives and exhibit strong correlations with human judgment for their respective dimensions.
Compared to the existing state-of-the-art metric, the combined metrics achieve around 16% relative improvement on average.
- Score: 58.46761798403072
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent model-based reference-free metrics for open-domain dialogue evaluation
exhibit promising correlations with human judgment. However, they either
perform turn-level evaluation or look at a single dialogue quality dimension.
One would expect a good evaluation metric to assess multiple quality dimensions
at the dialogue level. To this end, we are motivated to propose a
multi-dimensional dialogue-level metric, which consists of three sub-metrics
with each targeting a specific dimension. The sub-metrics are trained with
novel self-supervised objectives and exhibit strong correlations with human
judgment for their respective dimensions. Moreover, we explore two approaches
to combine the sub-metrics: metric ensemble and multitask learning. Both
approaches yield a holistic metric that significantly outperforms individual
sub-metrics. Compared to the existing state-of-the-art metric, the combined
metrics achieve around 16% relative improvement on average across three
high-quality dialogue-level evaluation benchmarks.
Related papers
- PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison [38.03304773600225]
PairEval is a novel dialogue evaluation metric for assessing responses by comparing their quality against responses in different conversations.
We show that PairEval exhibits a higher correlation with human judgments than baseline metrics.
We also find that the proposed comparative metric is more robust in detecting common failures from open-domain dialogue systems.
arXiv Detail & Related papers (2024-04-01T09:35:06Z) - MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for
Evaluating Open-Domain Dialogue [15.31433922183745]
We propose a Multi-Metric Evaluation based on Correlation Re-Scaling (MME-CRS) for evaluating open-domain dialogue.
MME-CRS ranks first on the final test data of DSTC10 track5 subtask1 Automatic Open-domain Dialogue Evaluation Challenge with a large margin.
arXiv Detail & Related papers (2022-06-19T13:43:59Z) - MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue
Evaluation [66.60285024216573]
A dialogue evaluator is expected to conduct assessment across domains as well.
Most of the state-of-the-art automatic dialogue evaluation metrics (ADMs) are not designed for multi-domain evaluation.
We are motivated to design a general and robust framework, MDD-Eval, to address the problem.
arXiv Detail & Related papers (2021-12-14T07:01:20Z) - A Comprehensive Assessment of Dialog Evaluation Metrics [9.34612743192798]
Standard language evaluation metrics are ineffective for evaluating dialog.
Recent research has proposed a number of novel, dialog-specific metrics that correlate better with human judgements.
This paper provides a comprehensive assessment of recently proposed dialog evaluation metrics on a number of datasets.
arXiv Detail & Related papers (2021-06-07T15:17:03Z) - DynaEval: Unifying Turn and Dialogue Level Evaluation [60.66883575106898]
We propose DynaEval, a unified automatic evaluation framework.
It is capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue.
Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model.
arXiv Detail & Related papers (2021-06-02T12:23:18Z) - Assessing Dialogue Systems with Distribution Distances [48.61159795472962]
We propose to measure the performance of a dialogue system by computing the distribution-wise distance between its generated conversations and real-world conversations.
Experiments on several dialogue corpora show that our proposed metrics correlate better with human judgments than existing metrics.
arXiv Detail & Related papers (2021-05-06T10:30:13Z) - Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of
Current Evaluation Protocols [17.14709845342071]
The current state of affairs suggests various evaluation protocols to assess chat-oriented dialogue management systems.
This paper presents a comprehensive synthesis of both automated and human evaluation methods on dialogue systems.
arXiv Detail & Related papers (2020-06-10T23:29:05Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z) - PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems.
Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective.
We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.