Related papers: NaturalConv: A Chinese Dialogue Dataset Towards Multi-turn Topic-driven Conversation

NaturalConv: A Chinese Dialogue Dataset Towards Multi-turn Topic-driven Conversation

URL: http://arxiv.org/abs/2103.02548v2
Date: Fri, 5 Mar 2021 17:12:20 GMT
Title: NaturalConv: A Chinese Dialogue Dataset Towards Multi-turn Topic-driven Conversation
Authors: Xiaoyang Wang, Chen Li, Jianqiao Zhao, Dong Yu
Abstract summary: In this paper, we propose a Chinese multi-turn topic-driven conversation dataset, NaturalConv. Our corpus contains 19.9K conversations from six domains, and 400K utterances with an average turn number of 20.1.
Score: 25.172938128539418
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we propose a Chinese multi-turn topic-driven conversation dataset, NaturalConv, which allows the participants to chat anything they want as long as any element from the topic is mentioned and the topic shift is smooth. Our corpus contains 19.9K conversations from six domains, and 400K utterances with an average turn number of 20.1. These conversations contain in-depth discussions on related topics or widely natural transition between multiple topics. We believe either way is normal for human conversation. To facilitate the research on this corpus, we provide results of several benchmark models. Comparative results show that for this dataset, our current models are not able to provide significant improvement by introducing background knowledge/topic. Therefore, the proposed dataset should be a good benchmark for further research to evaluate the validity and naturalness of multi-turn conversation systems. Our dataset is available at https://ai.tencent.com/ailab/nlp/dialogue/#datasets.

Related papers

CASPER: A Large Scale Spontaneous Speech Dataset [25.446606381490025]
This paper introduces our dataset and methodology, laying the groundwork for addressing the shortage of spontaneous speech data.<n>We plan to expand this dataset in future stages, offering a growing resource for the research community.
arXiv Detail & Related papers (2025-05-30T22:03:59Z)
NewsDialogues: Towards Proactive News Grounded Conversation [72.10055780635625]
We propose a novel task, Proactive News Grounded Conversation, in which a dialogue system can proactively lead the conversation based on some key topics of the news. To further develop this novel task, we collect a human-to-human Chinese dialogue dataset tsNewsDialogues, which includes 1K conversations with a total of 14.6K utterances.
arXiv Detail & Related papers (2023-08-12T08:33:42Z)
SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD. Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z)
PLACES: Prompting Language Models for Social Conversation Synthesis [103.94325597273316]
We use a small set of expert-written conversations as in-context examples to synthesize a social conversation dataset using prompting. We perform several thorough evaluations of our synthetic conversations compared to human-collected conversations.
arXiv Detail & Related papers (2023-02-07T05:48:16Z)
Findings on Conversation Disentanglement [28.874162427052905]
We build a learning model that learns utterance-to-utterance and utterance-to-thread classification. Experiments on the Ubuntu IRC dataset show that this approach has the potential to outperform the conventional greedy approach.
arXiv Detail & Related papers (2021-12-10T05:54:48Z)
TopiOCQA: Open-domain Conversational Question Answeringwith Topic Switching [11.717296856448566]
We introduce TopiOCQA, an open-domain conversational dataset with topic switches on Wikipedia. TopiOCQA contains 3,920 conversations with information-seeking questions and free-form answers. We evaluate several baselines, by combining state-of-the-art document retrieval methods with neural reader models.
arXiv Detail & Related papers (2021-10-02T09:53:48Z)
ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining [61.82562838486632]
We crowdsource four new datasets on diverse online conversation forms of news comments, discussion forums, community question answering forums, and email threads. We benchmark state-of-the-art models on our datasets and analyze characteristics associated with the data.
arXiv Detail & Related papers (2021-06-01T22:17:13Z)
MultiTalk: A Highly-Branching Dialog Testbed for Diverse Conversations [39.81965687032923]
We present the MultiTalk dataset, a corpus of over 320,000 sentences of written conversational dialog. We make multiple contributions to study dialog generation in the highly branching setting. Our culminating task is a challenging theory of mind problem, a controllable generation task.
arXiv Detail & Related papers (2021-02-02T02:29:40Z)
Response Selection for Multi-Party Conversations with Dynamic Topic Tracking [63.15158355071206]
We frame response selection as a dynamic topic tracking task to match the topic between the response and relevant conversation context. We propose a novel multi-task learning framework that supports efficient encoding through large pretrained models. Experimental results on the DSTC-8 Ubuntu IRC dataset show state-of-the-art results in response selection and topic disentanglement tasks.
arXiv Detail & Related papers (2020-10-15T14:21:38Z)
KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation [66.99734491847076]
We propose a Chinese multi-domain knowledge-driven conversation dataset, KdConv, which grounds the topics in multi-turn conversations to knowledge graphs. Our corpus contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0.
arXiv Detail & Related papers (2020-04-08T16:25:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.