NaturalConv: A Chinese Dialogue Dataset Towards Multi-turn Topic-driven
Conversation
- URL: http://arxiv.org/abs/2103.02548v2
- Date: Fri, 5 Mar 2021 17:12:20 GMT
- Title: NaturalConv: A Chinese Dialogue Dataset Towards Multi-turn Topic-driven
Conversation
- Authors: Xiaoyang Wang, Chen Li, Jianqiao Zhao, Dong Yu
- Abstract summary: In this paper, we propose a Chinese multi-turn topic-driven conversation dataset, NaturalConv.
Our corpus contains 19.9K conversations from six domains, and 400K utterances with an average turn number of 20.1.
- Score: 25.172938128539418
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a Chinese multi-turn topic-driven conversation
dataset, NaturalConv, which allows the participants to chat anything they want
as long as any element from the topic is mentioned and the topic shift is
smooth. Our corpus contains 19.9K conversations from six domains, and 400K
utterances with an average turn number of 20.1. These conversations contain
in-depth discussions on related topics or widely natural transition between
multiple topics. We believe either way is normal for human conversation. To
facilitate the research on this corpus, we provide results of several benchmark
models. Comparative results show that for this dataset, our current models are
not able to provide significant improvement by introducing background
knowledge/topic. Therefore, the proposed dataset should be a good benchmark for
further research to evaluate the validity and naturalness of multi-turn
conversation systems. Our dataset is available at
https://ai.tencent.com/ailab/nlp/dialogue/#datasets.
Related papers
- NewsDialogues: Towards Proactive News Grounded Conversation [72.10055780635625]
We propose a novel task, Proactive News Grounded Conversation, in which a dialogue system can proactively lead the conversation based on some key topics of the news.
To further develop this novel task, we collect a human-to-human Chinese dialogue dataset tsNewsDialogues, which includes 1K conversations with a total of 14.6K utterances.
arXiv Detail & Related papers (2023-08-12T08:33:42Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.
Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - PLACES: Prompting Language Models for Social Conversation Synthesis [103.94325597273316]
We use a small set of expert-written conversations as in-context examples to synthesize a social conversation dataset using prompting.
We perform several thorough evaluations of our synthetic conversations compared to human-collected conversations.
arXiv Detail & Related papers (2023-02-07T05:48:16Z) - Findings on Conversation Disentanglement [28.874162427052905]
We build a learning model that learns utterance-to-utterance and utterance-to-thread classification.
Experiments on the Ubuntu IRC dataset show that this approach has the potential to outperform the conventional greedy approach.
arXiv Detail & Related papers (2021-12-10T05:54:48Z) - TopiOCQA: Open-domain Conversational Question Answeringwith Topic
Switching [11.717296856448566]
We introduce TopiOCQA, an open-domain conversational dataset with topic switches on Wikipedia.
TopiOCQA contains 3,920 conversations with information-seeking questions and free-form answers.
We evaluate several baselines, by combining state-of-the-art document retrieval methods with neural reader models.
arXiv Detail & Related papers (2021-10-02T09:53:48Z) - ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive
Summarization with Argument Mining [61.82562838486632]
We crowdsource four new datasets on diverse online conversation forms of news comments, discussion forums, community question answering forums, and email threads.
We benchmark state-of-the-art models on our datasets and analyze characteristics associated with the data.
arXiv Detail & Related papers (2021-06-01T22:17:13Z) - MultiTalk: A Highly-Branching Dialog Testbed for Diverse Conversations [39.81965687032923]
We present the MultiTalk dataset, a corpus of over 320,000 sentences of written conversational dialog.
We make multiple contributions to study dialog generation in the highly branching setting.
Our culminating task is a challenging theory of mind problem, a controllable generation task.
arXiv Detail & Related papers (2021-02-02T02:29:40Z) - Response Selection for Multi-Party Conversations with Dynamic Topic
Tracking [63.15158355071206]
We frame response selection as a dynamic topic tracking task to match the topic between the response and relevant conversation context.
We propose a novel multi-task learning framework that supports efficient encoding through large pretrained models.
Experimental results on the DSTC-8 Ubuntu IRC dataset show state-of-the-art results in response selection and topic disentanglement tasks.
arXiv Detail & Related papers (2020-10-15T14:21:38Z) - KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn
Knowledge-driven Conversation [66.99734491847076]
We propose a Chinese multi-domain knowledge-driven conversation dataset, KdConv, which grounds the topics in multi-turn conversations to knowledge graphs.
Our corpus contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0.
arXiv Detail & Related papers (2020-04-08T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.