CoCo: Controllable Counterfactuals for Evaluating Dialogue State
Trackers
- URL: http://arxiv.org/abs/2010.12850v3
- Date: Fri, 26 Mar 2021 06:35:21 GMT
- Title: CoCo: Controllable Counterfactuals for Evaluating Dialogue State
Trackers
- Authors: Shiyang Li, Semih Yavuz, Kazuma Hashimoto, Jia Li, Tong Niu, Nazneen
Rajani, Xifeng Yan, Yingbo Zhou and Caiming Xiong
- Abstract summary: We propose controllable counterfactuals (CoCo) to bridge the gap and evaluate dialogue state tracking (DST) models on novel scenarios.
CoCo generates novel conversation scenarios in two steps: (i) counterfactual goal generation at turn-level by dropping and adding slots followed by replacing slot values, and (ii) counterfactual conversation generation that is conditioned on (i) and consistent with the dialogue flow.
Human evaluations show that COCO-generated conversations perfectly reflect the underlying user goal with more than 95% accuracy and are as human-like as the original conversations.
- Score: 92.5628632009802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dialogue state trackers have made significant progress on benchmark datasets,
but their generalization capability to novel and realistic scenarios beyond the
held-out conversations is less understood. We propose controllable
counterfactuals (CoCo) to bridge this gap and evaluate dialogue state tracking
(DST) models on novel scenarios, i.e., would the system successfully tackle the
request if the user responded differently but still consistently with the
dialogue flow? CoCo leverages turn-level belief states as counterfactual
conditionals to produce novel conversation scenarios in two steps: (i)
counterfactual goal generation at turn-level by dropping and adding slots
followed by replacing slot values, (ii) counterfactual conversation generation
that is conditioned on (i) and consistent with the dialogue flow. Evaluating
state-of-the-art DST models on MultiWOZ dataset with CoCo-generated
counterfactuals results in a significant performance drop of up to 30.8% (from
49.4% to 18.6%) in absolute joint goal accuracy. In comparison, widely used
techniques like paraphrasing only affect the accuracy by at most 2%. Human
evaluations show that COCO-generated conversations perfectly reflect the
underlying user goal with more than 95% accuracy and are as human-like as the
original conversations, further strengthening its reliability and promise to be
adopted as part of the robustness evaluation of DST models.
Related papers
- Chain of Thought Explanation for Dialogue State Tracking [52.015771676340016]
Dialogue state tracking (DST) aims to record user queries and goals during a conversational interaction.
We propose a model named Chain-of-Thought-Explanation (CoTE) for the DST task.
CoTE is designed to create detailed explanations step by step after determining the slot values.
arXiv Detail & Related papers (2024-03-07T16:59:55Z) - Mismatch between Multi-turn Dialogue and its Evaluation Metric in
Dialogue State Tracking [15.54992415806844]
Dialogue state tracking (DST) aims to extract essential information from multi-turn dialogue situations.
We propose textbfrelative slot accuracy to complement existing metrics.
This study also encourages not solely the reporting of joint goal accuracy, but also various complementary metrics in DST tasks.
arXiv Detail & Related papers (2022-03-07T04:07:36Z) - Dialogue State Tracking with Multi-Level Fusion of Predicted Dialogue
States and Conversations [2.6529642559155944]
We propose the Dialogue State Tracking with Multi-Level Fusion of Predicted Dialogue States and Conversations network.
This model extracts information of each dialogue turn by modeling interactions among each turn utterance, the corresponding last dialogue states, and dialogue slots.
arXiv Detail & Related papers (2021-07-12T02:30:30Z) - DynaEval: Unifying Turn and Dialogue Level Evaluation [60.66883575106898]
We propose DynaEval, a unified automatic evaluation framework.
It is capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue.
Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model.
arXiv Detail & Related papers (2021-06-02T12:23:18Z) - Towards Quantifiable Dialogue Coherence Evaluation [126.55560816209756]
Quantifiable Dialogue Coherence Evaluation (QuantiDCE) is a novel framework aiming to train a quantifiable dialogue coherence metric.
QuantiDCE includes two training stages, Multi-Level Ranking (MLR) pre-training and Knowledge Distillation (KD) fine-tuning.
Experimental results show that the model trained by QuantiDCE presents stronger correlations with human judgements than the other state-of-the-art metrics.
arXiv Detail & Related papers (2021-06-01T14:11:17Z) - I like fish, especially dolphins: Addressing Contradictions in Dialogue
Modeling [104.09033240889106]
We introduce the DialoguE COntradiction DEtection task (DECODE) and a new conversational dataset containing both human-human and human-bot contradictory dialogues.
We then compare a structured utterance-based approach of using pre-trained Transformer models for contradiction detection with the typical unstructured approach.
arXiv Detail & Related papers (2020-12-24T18:47:49Z) - Joint Turn and Dialogue level User Satisfaction Estimation on
Multi-Domain Conversations [6.129731338249762]
Current automated methods to estimate turn and dialogue level user satisfaction employ hand-crafted features.
We propose a novel user satisfaction estimation approach which minimizes an adaptive multi-task loss function.
The BiLSTM based deep neural net model automatically weighs each turn's contribution towards the estimated dialogue-level rating.
arXiv Detail & Related papers (2020-10-06T05:53:13Z) - CREDIT: Coarse-to-Fine Sequence Generation for Dialogue State Tracking [44.38388988238695]
A dialogue state tracker aims to accurately find a compact representation of the current dialogue status.
We employ a structured state representation and cast dialogue state tracking as a sequence generation problem.
Experiments demonstrate our tracker achieves encouraging joint goal accuracy for the five domains in MultiWOZ 2.0 and MultiWOZ 2.1 datasets.
arXiv Detail & Related papers (2020-09-22T10:27:18Z) - Rethinking Dialogue State Tracking with Reasoning [76.0991910623001]
This paper proposes to track dialogue states gradually with reasoning over dialogue turns with the help of the back-end data.
Empirical results demonstrate that our method significantly outperforms the state-of-the-art methods by 38.6% in terms of joint belief accuracy for MultiWOZ 2.1.
arXiv Detail & Related papers (2020-05-27T02:05:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.