RiSAWOZ: A Large-Scale Multi-Domain Wizard-of-Oz Dataset with Rich
Semantic Annotations for Task-Oriented Dialogue Modeling
- URL: http://arxiv.org/abs/2010.08738v1
- Date: Sat, 17 Oct 2020 08:18:59 GMT
- Title: RiSAWOZ: A Large-Scale Multi-Domain Wizard-of-Oz Dataset with Rich
Semantic Annotations for Task-Oriented Dialogue Modeling
- Authors: Jun Quan, Shian Zhang, Qian Cao, Zizhong Li and Deyi Xiong
- Abstract summary: RiSAWOZ is a large-scale multi-domain Chinese Wizard-of-Oz dataset with Rich Semantic s.
It contains 11.2K human-to-human (H2H) multi-turn semantically annotated dialogues, with more than 150K utterances spanning over 12 domains.
- Score: 35.75880078666584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In order to alleviate the shortage of multi-domain data and to capture
discourse phenomena for task-oriented dialogue modeling, we propose RiSAWOZ, a
large-scale multi-domain Chinese Wizard-of-Oz dataset with Rich Semantic
Annotations. RiSAWOZ contains 11.2K human-to-human (H2H) multi-turn
semantically annotated dialogues, with more than 150K utterances spanning over
12 domains, which is larger than all previous annotated H2H conversational
datasets. Both single- and multi-domain dialogues are constructed, accounting
for 65% and 35%, respectively. Each dialogue is labeled with comprehensive
dialogue annotations, including dialogue goal in the form of natural language
description, domain, dialogue states and acts at both the user and system side.
In addition to traditional dialogue annotations, we especially provide
linguistic annotations on discourse phenomena, e.g., ellipsis and coreference,
in dialogues, which are useful for dialogue coreference and ellipsis resolution
tasks. Apart from the fully annotated dataset, we also present a detailed
description of the data collection procedure, statistics and analysis of the
dataset. A series of benchmark models and results are reported, including
natural language understanding (intent detection & slot filling), dialogue
state tracking and dialogue context-to-text generation, as well as coreference
and ellipsis resolution, which facilitate the baseline comparison for future
research on this corpus.
Related papers
- Multi-turn Dialogue Comprehension from a Topic-aware Perspective [70.37126956655985]
This paper proposes to model multi-turn dialogues from a topic-aware perspective.
We use a dialogue segmentation algorithm to split a dialogue passage into topic-concentrated fragments in an unsupervised way.
We also present a novel model, Topic-Aware Dual-Attention Matching (TADAM) Network, which takes topic segments as processing elements.
arXiv Detail & Related papers (2023-09-18T11:03:55Z) - DialogStudio: Towards Richest and Most Diverse Unified Dataset
Collection for Conversational AI [92.29874802394167]
DialogStudio is the largest and most diverse collection of dialogue datasets.
Our collection encompasses data from open-domain dialogues, task-oriented dialogues, natural language understanding, conversational recommendation, dialogue summarization, and knowledge-grounded dialogues.
arXiv Detail & Related papers (2023-07-19T17:57:53Z) - MD3: The Multi-Dialect Dataset of Dialogues [20.144004030947507]
We introduce a new dataset of conversational speech representing English from India, Nigeria, and the United States.
The dataset includes more than 20 hours of audio and more than 200,000 orthographically-transcribed tokens.
arXiv Detail & Related papers (2023-05-19T00:14:10Z) - SuperDialseg: A Large-scale Dataset for Supervised Dialogue Segmentation [55.82577086422923]
We provide a feasible definition of dialogue segmentation points with the help of document-grounded dialogues.
We release a large-scale supervised dataset called SuperDialseg, containing 9,478 dialogues.
We also provide a benchmark including 18 models across five categories for the dialogue segmentation task.
arXiv Detail & Related papers (2023-05-15T06:08:01Z) - Dialogue Term Extraction using Transfer Learning and Topological Data
Analysis [0.8185867455104834]
We explore different features that can enable systems to discover realizations of domains, slots, and values in dialogues in a purely data-driven fashion.
To examine the utility of each feature set, we train a seed model based on the widely used MultiWOZ data-set.
Our method outperforms the previously proposed approach that relies solely on word embeddings.
arXiv Detail & Related papers (2022-08-22T17:04:04Z) - Manual-Guided Dialogue for Flexible Conversational Agents [84.46598430403886]
How to build and use dialogue data efficiently, and how to deploy models in different domains at scale can be critical issues in building a task-oriented dialogue system.
We propose a novel manual-guided dialogue scheme, where the agent learns the tasks from both dialogue and manuals.
Our proposed scheme reduces the dependence of dialogue models on fine-grained domain ontology, and makes them more flexible to adapt to various domains.
arXiv Detail & Related papers (2022-08-16T08:21:12Z) - Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding.
COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z) - OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual
Contexts [35.57757367869986]
We release bf OpenViDial, a large-scale multi- module dialogue dataset.
OpenViDial contains a total number of 1.1 million dialogue turns.
We propose a family of encoder-decoder models leveraging both textual and visual contexts.
arXiv Detail & Related papers (2020-12-30T03:02:50Z) - Interview: A Large-Scale Open-Source Corpus of Media Dialog [11.28504775964698]
We introduce 'Interview': a large-scale (105K conversations) media dialog dataset collected from news interview transcripts.
Compared to existing large-scale proxies for conversational data, language models trained on our dataset exhibit better zero-shot out-of-domain performance.
'Interview' contains speaker role annotations for each turn, facilitating the development of engaging, responsive dialog systems.
arXiv Detail & Related papers (2020-04-07T02:44:50Z) - CrossWOZ: A Large-Scale Chinese Cross-Domain Task-Oriented Dialogue
Dataset [58.910961297314415]
CrossWOZ is the first large-scale Chinese Cross-Domain Wizard-of-Oz task-oriented dataset.
It contains 6K dialogue sessions and 102K utterances for 5 domains, including hotel, restaurant, attraction, metro, and taxi.
arXiv Detail & Related papers (2020-02-27T03:06:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.