Annotation Inconsistency and Entity Bias in MultiWOZ
- URL: http://arxiv.org/abs/2105.14150v1
- Date: Sat, 29 May 2021 00:09:06 GMT
- Title: Annotation Inconsistency and Entity Bias in MultiWOZ
- Authors: Kun Qian, Ahmad Beirami, Zhouhan Lin, Ankita De, Alborz Geramifard,
Zhou Yu, Chinnadhurai Sankar
- Abstract summary: MultiWOZ is one of the most popular multi-domain task-oriented dialog datasets.
It has been widely accepted as a benchmark for various dialog tasks, e.g., dialog state tracking (DST), natural language generation (NLG), and end-to-end (E2E) dialog modeling.
- Score: 40.127114829948965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: MultiWOZ is one of the most popular multi-domain task-oriented dialog
datasets, containing 10K+ annotated dialogs covering eight domains. It has been
widely accepted as a benchmark for various dialog tasks, e.g., dialog state
tracking (DST), natural language generation (NLG), and end-to-end (E2E) dialog
modeling. In this work, we identify an overlooked issue with dialog state
annotation inconsistencies in the dataset, where a slot type is tagged
inconsistently across similar dialogs leading to confusion for DST modeling. We
propose an automated correction for this issue, which is present in a whopping
70% of the dialogs. Additionally, we notice that there is significant entity
bias in the dataset (e.g., "cambridge" appears in 50% of the destination cities
in the train domain). The entity bias can potentially lead to named entity
memorization in generative models, which may go unnoticed as the test set
suffers from a similar entity bias as well. We release a new test set with all
entities replaced with unseen entities. Finally, we benchmark joint goal
accuracy (JGA) of the state-of-the-art DST baselines on these modified versions
of the data. Our experiments show that the annotation inconsistency corrections
lead to 7-10% improvement in JGA. On the other hand, we observe a 29% drop in
JGA when models are evaluated on the new test set with unseen entities.
Related papers
- SPACE-2: Tree-Structured Semi-Supervised Contrastive Pre-training for
Task-Oriented Dialog Understanding [68.94808536012371]
We propose a tree-structured pre-trained conversation model, which learns dialog representations from limited labeled dialogs and large-scale unlabeled dialog corpora.
Our method can achieve new state-of-the-art results on the DialoGLUE benchmark consisting of seven datasets and four popular dialog understanding tasks.
arXiv Detail & Related papers (2022-09-14T13:42:50Z) - CheckDST: Measuring Real-World Generalization of Dialogue State Tracking
Performance [18.936466253481363]
We design a collection of metrics called CheckDST to test well-known weaknesses with augmented test sets.
We find that span-based classification models are resilient to unseen named entities but not robust to language variety.
Due to their respective weaknesses, neither approach is yet suitable for real-world deployment.
arXiv Detail & Related papers (2021-12-15T18:10:54Z) - Contextual Semantic Parsing for Multilingual Task-Oriented Dialogues [7.8378818005171125]
Given a large-scale dialogue data set in one language, we can automatically produce an effective semantic for other languages using machine translation.
We propose automatic translation of dialogue datasets with alignment to ensure faithful translation of slot values.
We show that the succinct representation reduces the compounding effect of translation errors.
arXiv Detail & Related papers (2021-11-04T01:08:14Z) - Zero-Shot Dialogue Disentanglement by Self-Supervised Entangled Response
Selection [79.37200787463917]
Dialogue disentanglement aims to group utterances in a long and multi-participant dialogue into threads.
This is useful for discourse analysis and downstream applications such as dialogue response selection.
We are the first to propose atextbfzero-shot dialogue disentanglement solution.
arXiv Detail & Related papers (2021-10-25T05:15:01Z) - Zero-shot Generalization in Dialog State Tracking through Generative
Question Answering [10.81203437307028]
We introduce a novel framework that supports natural language queries for unseen constraints and slots in task-oriented dialogs.
Our approach is based on generative question-answering using a conditional domain model pre-trained on substantive English sentences.
arXiv Detail & Related papers (2021-01-20T21:47:20Z) - Improving Limited Labeled Dialogue State Tracking with Self-Supervision [91.68515201803986]
Existing dialogue state tracking (DST) models require plenty of labeled data.
We present and investigate two self-supervised objectives: preserving latent consistency and modeling conversational behavior.
Our proposed self-supervised signals can improve joint goal accuracy by 8.95% when only 1% labeled data is used.
arXiv Detail & Related papers (2020-10-26T21:57:42Z) - MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections
and State Tracking Baselines [15.540213987132839]
This work introduces MultiWOZ 2.2, which is a yet another improved version of this dataset.
Firstly, we identify and fix dialogue state annotation errors across 17.3% of the utterances on top of MultiWOZ 2.1.
Secondly, we redefine the vocabularies of slots with a large number of possible values.
arXiv Detail & Related papers (2020-07-10T22:52:14Z) - Paraphrase Augmented Task-Oriented Dialog Generation [68.1790912977053]
We propose a paraphrase augmented response generation (PARG) framework that jointly trains a paraphrase model and a response generation model.
We also design a method to automatically construct paraphrase training data set based on dialog state and dialog act labels.
arXiv Detail & Related papers (2020-04-16T05:12:36Z) - Evaluating Models' Local Decision Boundaries via Contrast Sets [119.38387782979474]
We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data.
We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets.
Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets.
arXiv Detail & Related papers (2020-04-06T14:47:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.