MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for
Evaluating Open-Domain Dialogue
- URL: http://arxiv.org/abs/2206.09403v1
- Date: Sun, 19 Jun 2022 13:43:59 GMT
- Title: MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for
Evaluating Open-Domain Dialogue
- Authors: Pengfei Zhang, Xiaohui Hu, Kaidong Yu, Jian Wang, Song Han, Cao Liu,
Chunyang Yuan
- Abstract summary: We propose a Multi-Metric Evaluation based on Correlation Re-Scaling (MME-CRS) for evaluating open-domain dialogue.
MME-CRS ranks first on the final test data of DSTC10 track5 subtask1 Automatic Open-domain Dialogue Evaluation Challenge with a large margin.
- Score: 15.31433922183745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic open-domain dialogue evaluation is a crucial component of dialogue
systems. Recently, learning-based evaluation metrics have achieved
state-of-the-art performance in open-domain dialogue evaluation. However, these
metrics, which only focus on a few qualities, are hard to evaluate dialogue
comprehensively. Furthermore, these metrics lack an effective score composition
approach for diverse evaluation qualities. To address the above problems, we
propose a Multi-Metric Evaluation based on Correlation Re-Scaling (MME-CRS) for
evaluating open-domain dialogue. Firstly, we build an evaluation metric
composed of 5 groups of parallel sub-metrics called Multi-Metric Evaluation
(MME) to evaluate the quality of dialogue comprehensively. Furthermore, we
propose a novel score composition method called Correlation Re-Scaling (CRS) to
model the relationship between sub-metrics and diverse qualities. Our approach
MME-CRS ranks first on the final test data of DSTC10 track5 subtask1 Automatic
Open-domain Dialogue Evaluation Challenge with a large margin, which proved the
effectiveness of our proposed approach.
Related papers
- ComperDial: Commonsense Persona-grounded Dialogue Dataset and Benchmark [26.100299485985197]
ComperDial consists of human-scored responses for 10,395 dialogue turns in 1,485 conversations collected from 99 dialogue agents.
In addition to single-turn response scores, ComperDial also contains dialogue-level human-annotated scores.
Building off ComperDial, we devise a new automatic evaluation metric to measure the general similarity of model-generated dialogues to human conversations.
arXiv Detail & Related papers (2024-06-17T05:51:04Z) - PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison [38.03304773600225]
PairEval is a novel dialogue evaluation metric for assessing responses by comparing their quality against responses in different conversations.
We show that PairEval exhibits a higher correlation with human judgments than baseline metrics.
We also find that the proposed comparative metric is more robust in detecting common failures from open-domain dialogue systems.
arXiv Detail & Related papers (2024-04-01T09:35:06Z) - DiQAD: A Benchmark Dataset for End-to-End Open-domain Dialogue
Assessment [38.26039323208791]
We release a large-scale dialogue quality assessment dataset (DiQAD) for automatically assessing open-domain dialogue quality.
Specifically, we establish the assessment criteria based on the dimensions conforming to human judgements on dialogue qualities.
We also annotate large-scale dialogues that conversed between real users based on these annotation criteria, which contains around 100,000 dialogues.
arXiv Detail & Related papers (2023-10-25T03:04:57Z) - Simple LLM Prompting is State-of-the-Art for Robust and Multilingual
Dialogue Evaluation [7.767020408405403]
We propose a novel framework that takes advantage of the strengths of current evaluation models with the newly-established paradigm of prompting Large Language Models (LLMs)
Empirical results show our framework achieves state of the art results in terms of mean Spearman correlation scores across several benchmarks.
arXiv Detail & Related papers (2023-08-31T15:19:28Z) - C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue
Evaluation [68.59356746305255]
We propose a novel model-agnostic approach to measure the turn-level interaction between the system and the user.
Our approach significantly improves the correlation with human judgment compared with existing evaluation systems.
arXiv Detail & Related papers (2023-06-27T06:58:03Z) - FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation [58.46761798403072]
We propose a dialogue-level metric that consists of three sub-metrics with each targeting a specific dimension.
The sub-metrics are trained with novel self-supervised objectives and exhibit strong correlations with human judgment for their respective dimensions.
Compared to the existing state-of-the-art metric, the combined metrics achieve around 16% relative improvement on average.
arXiv Detail & Related papers (2022-10-25T08:26:03Z) - FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment
Act Flows [63.116280145770006]
We propose segment act, an extension of dialog act from utterance level to segment level, and crowdsource a large-scale dataset for it.
To utilize segment act flows, sequences of segment acts, for evaluation, we develop the first consensus-based dialogue evaluation framework, FlowEval.
arXiv Detail & Related papers (2022-02-14T11:37:20Z) - MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue
Evaluation [66.60285024216573]
A dialogue evaluator is expected to conduct assessment across domains as well.
Most of the state-of-the-art automatic dialogue evaluation metrics (ADMs) are not designed for multi-domain evaluation.
We are motivated to design a general and robust framework, MDD-Eval, to address the problem.
arXiv Detail & Related papers (2021-12-14T07:01:20Z) - Assessing Dialogue Systems with Distribution Distances [48.61159795472962]
We propose to measure the performance of a dialogue system by computing the distribution-wise distance between its generated conversations and real-world conversations.
Experiments on several dialogue corpora show that our proposed metrics correlate better with human judgments than existing metrics.
arXiv Detail & Related papers (2021-05-06T10:30:13Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.