Towards Quantifiable Dialogue Coherence Evaluation
- URL: http://arxiv.org/abs/2106.00507v1
- Date: Tue, 1 Jun 2021 14:11:17 GMT
- Title: Towards Quantifiable Dialogue Coherence Evaluation
- Authors: Zheng Ye, Liucun Lu, Lishan Huang, Liang Lin, Xiaodan Liang
- Abstract summary: Quantifiable Dialogue Coherence Evaluation (QuantiDCE) is a novel framework aiming to train a quantifiable dialogue coherence metric.
QuantiDCE includes two training stages, Multi-Level Ranking (MLR) pre-training and Knowledge Distillation (KD) fine-tuning.
Experimental results show that the model trained by QuantiDCE presents stronger correlations with human judgements than the other state-of-the-art metrics.
- Score: 126.55560816209756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic dialogue coherence evaluation has attracted increasing attention
and is crucial for developing promising dialogue systems. However, existing
metrics have two major limitations: (a) they are mostly trained in a simplified
two-level setting (coherent vs. incoherent), while humans give Likert-type
multi-level coherence scores, dubbed as "quantifiable"; (b) their predicted
coherence scores cannot align with the actual human rating standards due to the
absence of human guidance during training. To address these limitations, we
propose Quantifiable Dialogue Coherence Evaluation (QuantiDCE), a novel
framework aiming to train a quantifiable dialogue coherence metric that can
reflect the actual human rating standards. Specifically, QuantiDCE includes two
training stages, Multi-Level Ranking (MLR) pre-training and Knowledge
Distillation (KD) fine-tuning. During MLR pre-training, a new MLR loss is
proposed for enabling the model to learn the coarse judgement of coherence
degrees. Then, during KD fine-tuning, the pretrained model is further finetuned
to learn the actual human rating standards with only very few human-annotated
data. To advocate the generalizability even with limited fine-tuning data, a
novel KD regularization is introduced to retain the knowledge learned at the
pre-training stage. Experimental results show that the model trained by
QuantiDCE presents stronger correlations with human judgements than the other
state-of-the-art metrics.
Related papers
- LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.
We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.
Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z) - Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator [6.532478490187084]
MESA employs a three-step assessment of individual error types, multi-agent discussion for decision refinement, and feedback-based self-training to refine error definition understanding and alignment with human judgment.
Using GPT-4o as its backbone, MESA achieves correlation with human judgment in error detection and mid Spearman and Kendall correlation in reflecting error impact on summary quality, on average 0.25 higher than previous methods.
arXiv Detail & Related papers (2024-11-27T15:35:32Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust
Machine Translation Evaluation [12.407789866525079]
We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena.
We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena.
arXiv Detail & Related papers (2023-05-30T15:50:46Z) - Don't Copy the Teacher: Data and Model Challenges in Embodied Dialogue [92.01165203498299]
Embodied dialogue instruction following requires an agent to complete a complex sequence of tasks from a natural language exchange.
This paper argues that imitation learning (IL) and related low-level metrics are actually misleading and do not align with the goals of embodied dialogue research.
arXiv Detail & Related papers (2022-10-10T05:51:40Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.