Related papers: KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation

KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation

URL: http://arxiv.org/abs/2004.04100v1
Date: Wed, 8 Apr 2020 16:25:39 GMT
Title: KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation
Authors: Hao Zhou, Chujie Zheng, Kaili Huang, Minlie Huang, Xiaoyan Zhu
Abstract summary: We propose a Chinese multi-domain knowledge-driven conversation dataset, KdConv, which grounds the topics in multi-turn conversations to knowledge graphs. Our corpus contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0.
Score: 66.99734491847076
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The research of knowledge-driven conversational systems is largely limited due to the lack of dialog data which consist of multi-turn conversations on multiple topics and with knowledge annotations. In this paper, we propose a Chinese multi-domain knowledge-driven conversation dataset, KdConv, which grounds the topics in multi-turn conversations to knowledge graphs. Our corpus contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0. These conversations contain in-depth discussions on related topics and natural transition between multiple topics. To facilitate the following research on this corpus, we provide several benchmark models. Comparative results show that the models can be enhanced by introducing background knowledge, yet there is still a large space for leveraging knowledge to model multi-turn conversations for further research. Results also show that there are obvious performance differences between different domains, indicating that it is worth to further explore transfer learning and domain adaptation. The corpus and benchmark models are publicly available.

Related papers

WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z)
Multi-Granularity Prompts for Topic Shift Detection in Dialogue [13.739991183173494]
The goal of dialogue topic shift detection is to identify whether the current topic in a conversation has changed or needs to change. Previous work focused on detecting topic shifts using pre-trained models to encode the utterance. We take a prompt-based approach to fully extract topic information from dialogues at multiple-granularity, i.e., label, turn, and topic.
arXiv Detail & Related papers (2023-05-23T12:35:49Z)
TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World [97.58623810402563]
We introduce a new video-based multi-modal dialogue dataset, called TikTalk. We collect 38K videos from a popular video-sharing platform, along with 367K conversations posted by users beneath them. Users engage in spontaneous conversations based on their multi-modal experiences from watching videos, which helps recreate real-world chitchat context.
arXiv Detail & Related papers (2023-01-14T10:18:22Z)
CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog Evaluation [75.60156479374416]
CGoDial is a new challenging and comprehensive Chinese benchmark for Goal-oriented Dialog evaluation. It contains 96,763 dialog sessions and 574,949 dialog turns totally, covering three datasets with different knowledge sources. To bridge the gap between academic benchmarks and spoken dialog scenarios, we either collect data from real conversations or add spoken features to existing datasets via crowd-sourcing.
arXiv Detail & Related papers (2022-11-21T16:21:41Z)
Exploring Effective Information Utilization in Multi-Turn Topic-Driven Conversations [11.550422073645425]
We encode topic and dialogue history information using certain prompts with multiple channels of Fusion-in-Decoder (FiD) In this paper, our experiments focus on a specific Chinese dataset named NaturalConv, where the conversation revolves around a piece of recent news.
arXiv Detail & Related papers (2022-09-01T06:20:39Z)
Advancing an Interdisciplinary Science of Conversation: Insights from a Large Multimodal Corpus of Human Speech [0.12038936091716987]
In this report we advance an interdisciplinary science of conversation, with findings from a large, multimodal corpus of 1,656 recorded conversations in spoken English. This 7+ million word, 850 hour corpus totals over 1TB of audio, video, and transcripts, with moment-to-moment measures of vocal, facial, and semantic expression. We report (5) a comprehensive mixed-method report, based on quantitative analysis and qualitative review of each recording, that showcases how individuals from diverse backgrounds alter their communication patterns and find ways to connect.
arXiv Detail & Related papers (2022-03-01T18:50:33Z)
QAConv: Question Answering on Informative Conversations [85.2923607672282]
We focus on informative conversations including business emails, panel discussions, and work channels. In total, we collect 34,204 QA pairs, including span-based, free-form, and unanswerable questions.
arXiv Detail & Related papers (2021-05-14T15:53:05Z)
NaturalConv: A Chinese Dialogue Dataset Towards Multi-turn Topic-driven Conversation [28.085557013067678]
In this paper, we propose a Chinese multi-turn topic-driven conversation dataset, NaturalConv. Our corpus contains 19.9K conversations from six domains, and 400K utterances with an average turn number of 20.1.
arXiv Detail & Related papers (2021-03-03T17:38:33Z)
MultiTalk: A Highly-Branching Dialog Testbed for Diverse Conversations [39.81965687032923]
We present the MultiTalk dataset, a corpus of over 320,000 sentences of written conversational dialog. We make multiple contributions to study dialog generation in the highly branching setting. Our culminating task is a challenging theory of mind problem, a controllable generation task.
arXiv Detail & Related papers (2021-02-02T02:29:40Z)
Multi-View Sequence-to-Sequence Models with Conversational Structure for Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP. This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations. Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.