Mismatch between Multi-turn Dialogue and its Evaluation Metric in
Dialogue State Tracking
- URL: http://arxiv.org/abs/2203.03123v1
- Date: Mon, 7 Mar 2022 04:07:36 GMT
- Title: Mismatch between Multi-turn Dialogue and its Evaluation Metric in
Dialogue State Tracking
- Authors: Takyoung Kim, Hoonsang Yoon, Yukyung Lee, Pilsung Kang, Misuk Kim
- Abstract summary: Dialogue state tracking (DST) aims to extract essential information from multi-turn dialogue situations.
We propose textbfrelative slot accuracy to complement existing metrics.
This study also encourages not solely the reporting of joint goal accuracy, but also various complementary metrics in DST tasks.
- Score: 15.54992415806844
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dialogue state tracking (DST) aims to extract essential information from
multi-turn dialogue situations and take appropriate actions. A belief state,
one of the core pieces of information, refers to the subject and its specific
content, and appears in the form of \texttt{domain-slot-value}. The trained
model predicts "accumulated" belief states in every turn, and joint goal
accuracy and slot accuracy are mainly used to evaluate the prediction; however,
we specify that the current evaluation metrics have a critical limitation when
evaluating belief states accumulated as the dialogue proceeds, especially in
the most used MultiWOZ dataset. Additionally, we propose \textbf{relative slot
accuracy} to complement existing metrics. Relative slot accuracy does not
depend on the number of predefined slots, and allows intuitive evaluation by
assigning relative scores according to the turn of each dialogue. This study
also encourages not solely the reporting of joint goal accuracy, but also
various complementary metrics in DST tasks for the sake of a realistic
evaluation.
Related papers
- Chain of Thought Explanation for Dialogue State Tracking [52.015771676340016]
Dialogue state tracking (DST) aims to record user queries and goals during a conversational interaction.
We propose a model named Chain-of-Thought-Explanation (CoTE) for the DST task.
CoTE is designed to create detailed explanations step by step after determining the slot values.
arXiv Detail & Related papers (2024-03-07T16:59:55Z) - SWING: Balancing Coverage and Faithfulness for Dialogue Summarization [67.76393867114923]
We propose to utilize natural language inference (NLI) models to improve coverage while avoiding factual inconsistencies.
We use NLI to compute fine-grained training signals to encourage the model to generate content in the reference summaries that have not been covered.
Experiments on the DialogSum and SAMSum datasets confirm the effectiveness of the proposed approach.
arXiv Detail & Related papers (2023-01-25T09:33:11Z) - CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog
Evaluation [75.60156479374416]
CGoDial is a new challenging and comprehensive Chinese benchmark for Goal-oriented Dialog evaluation.
It contains 96,763 dialog sessions and 574,949 dialog turns totally, covering three datasets with different knowledge sources.
To bridge the gap between academic benchmarks and spoken dialog scenarios, we either collect data from real conversations or add spoken features to existing datasets via crowd-sourcing.
arXiv Detail & Related papers (2022-11-21T16:21:41Z) - ED-FAITH: Evaluating Dialogue Summarization on Faithfulness [35.73012379398233]
We first present a systematic study of faithfulness metrics for dialogue summarization.
We observe that most metrics correlate poorly with human judgements despite performing well on news datasets.
We propose T0-Score -- a new metric for faithfulness evaluation.
arXiv Detail & Related papers (2022-11-15T19:33:50Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - Coreference Augmentation for Multi-Domain Task-Oriented Dialogue State
Tracking [3.34618986084988]
We propose Coreference Dialogue State Tracker (CDST) that explicitly models the coreference feature.
Experimental results on MultiWOZ 2.1 dataset show that the proposed model achieves the state-of-the-art joint goal accuracy of 56.47%.
arXiv Detail & Related papers (2021-06-16T11:47:29Z) - CoCo: Controllable Counterfactuals for Evaluating Dialogue State
Trackers [92.5628632009802]
We propose controllable counterfactuals (CoCo) to bridge the gap and evaluate dialogue state tracking (DST) models on novel scenarios.
CoCo generates novel conversation scenarios in two steps: (i) counterfactual goal generation at turn-level by dropping and adding slots followed by replacing slot values, and (ii) counterfactual conversation generation that is conditioned on (i) and consistent with the dialogue flow.
Human evaluations show that COCO-generated conversations perfectly reflect the underlying user goal with more than 95% accuracy and are as human-like as the original conversations.
arXiv Detail & Related papers (2020-10-24T09:39:35Z) - Rethinking Dialogue State Tracking with Reasoning [76.0991910623001]
This paper proposes to track dialogue states gradually with reasoning over dialogue turns with the help of the back-end data.
Empirical results demonstrate that our method significantly outperforms the state-of-the-art methods by 38.6% in terms of joint belief accuracy for MultiWOZ 2.1.
arXiv Detail & Related papers (2020-05-27T02:05:33Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.