Towards Quantifiable Dialogue Coherence Evaluation
- URL: http://arxiv.org/abs/2106.00507v1
- Date: Tue, 1 Jun 2021 14:11:17 GMT
- Title: Towards Quantifiable Dialogue Coherence Evaluation
- Authors: Zheng Ye, Liucun Lu, Lishan Huang, Liang Lin, Xiaodan Liang
- Abstract summary: Quantifiable Dialogue Coherence Evaluation (QuantiDCE) is a novel framework aiming to train a quantifiable dialogue coherence metric.
QuantiDCE includes two training stages, Multi-Level Ranking (MLR) pre-training and Knowledge Distillation (KD) fine-tuning.
Experimental results show that the model trained by QuantiDCE presents stronger correlations with human judgements than the other state-of-the-art metrics.
- Score: 126.55560816209756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic dialogue coherence evaluation has attracted increasing attention
and is crucial for developing promising dialogue systems. However, existing
metrics have two major limitations: (a) they are mostly trained in a simplified
two-level setting (coherent vs. incoherent), while humans give Likert-type
multi-level coherence scores, dubbed as "quantifiable"; (b) their predicted
coherence scores cannot align with the actual human rating standards due to the
absence of human guidance during training. To address these limitations, we
propose Quantifiable Dialogue Coherence Evaluation (QuantiDCE), a novel
framework aiming to train a quantifiable dialogue coherence metric that can
reflect the actual human rating standards. Specifically, QuantiDCE includes two
training stages, Multi-Level Ranking (MLR) pre-training and Knowledge
Distillation (KD) fine-tuning. During MLR pre-training, a new MLR loss is
proposed for enabling the model to learn the coarse judgement of coherence
degrees. Then, during KD fine-tuning, the pretrained model is further finetuned
to learn the actual human rating standards with only very few human-annotated
data. To advocate the generalizability even with limited fine-tuning data, a
novel KD regularization is introduced to retain the knowledge learned at the
pre-training stage. Experimental results show that the model trained by
QuantiDCE presents stronger correlations with human judgements than the other
state-of-the-art metrics.
Related papers
- Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust
Machine Translation Evaluation [12.407789866525079]
We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena.
We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena.
arXiv Detail & Related papers (2023-05-30T15:50:46Z) - FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation [58.46761798403072]
We propose a dialogue-level metric that consists of three sub-metrics with each targeting a specific dimension.
The sub-metrics are trained with novel self-supervised objectives and exhibit strong correlations with human judgment for their respective dimensions.
Compared to the existing state-of-the-art metric, the combined metrics achieve around 16% relative improvement on average.
arXiv Detail & Related papers (2022-10-25T08:26:03Z) - Don't Copy the Teacher: Data and Model Challenges in Embodied Dialogue [92.01165203498299]
Embodied dialogue instruction following requires an agent to complete a complex sequence of tasks from a natural language exchange.
This paper argues that imitation learning (IL) and related low-level metrics are actually misleading and do not align with the goals of embodied dialogue research.
arXiv Detail & Related papers (2022-10-10T05:51:40Z) - SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption
Evaluation via Typicality Analysis [20.026835809227283]
We introduce "typicality", a new formulation of evaluation rooted in information theory.
We show how these decomposed dimensions of semantics and fluency provide greater system-level insight into captioner differences.
Our proposed metrics along with their combination, SMURF, achieve state-of-the-art correlation with human judgment when compared with other rule-based evaluation metrics.
arXiv Detail & Related papers (2021-06-02T19:58:20Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.