Interview: A Large-Scale Open-Source Corpus of Media Dialog
- URL: http://arxiv.org/abs/2004.03090v1
- Date: Tue, 7 Apr 2020 02:44:50 GMT
- Title: Interview: A Large-Scale Open-Source Corpus of Media Dialog
- Authors: Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni, Julian McAuley
- Abstract summary: We introduce 'Interview': a large-scale (105K conversations) media dialog dataset collected from news interview transcripts.
Compared to existing large-scale proxies for conversational data, language models trained on our dataset exhibit better zero-shot out-of-domain performance.
'Interview' contains speaker role annotations for each turn, facilitating the development of engaging, responsive dialog systems.
- Score: 11.28504775964698
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing conversational datasets consist either of written proxies for dialog
or small-scale transcriptions of natural speech. We introduce 'Interview': a
large-scale (105K conversations) media dialog dataset collected from news
interview transcripts. Compared to existing large-scale proxies for
conversational data, language models trained on our dataset exhibit better
zero-shot out-of-domain performance on existing spoken dialog datasets,
demonstrating its usefulness in modeling real-world conversations. 'Interview'
contains speaker role annotations for each turn, facilitating the development
of engaging, responsive dialog systems. In fact, experiments on two dialog
tasks show that leveraging such labels improves performance over strong
speaker-agnostic baselines, and enabling models to generate more specific and
inquisitive responses in interview-style conversations.
Related papers
- Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation [55.043492250775294]
We introduce a novel Face-to-Face spoken dialogue model.
It processes audio-visual speech from user input and generates audio-visual speech as the response.
We also introduce MultiDialog, the first large-scale multimodal spoken dialogue corpus.
arXiv Detail & Related papers (2024-06-12T04:48:36Z) - MP2D: An Automated Topic Shift Dialogue Generation Framework Leveraging
Knowledge Graphs [15.876075659237722]
Multi-Passage to Dialogue (MP2D) generates question-answering datasets with natural topic transitions.
MP2D maps the flow of topics within a dialogue, effectively mirroring the dynamics of human conversation.
This study introduces a novel benchmark for topic shift dialogues, TS-WikiDialog.
arXiv Detail & Related papers (2024-03-09T06:28:48Z) - DialogStudio: Towards Richest and Most Diverse Unified Dataset
Collection for Conversational AI [92.29874802394167]
DialogStudio is the largest and most diverse collection of dialogue datasets.
Our collection encompasses data from open-domain dialogues, task-oriented dialogues, natural language understanding, conversational recommendation, dialogue summarization, and knowledge-grounded dialogues.
arXiv Detail & Related papers (2023-07-19T17:57:53Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.
Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - SuperDialseg: A Large-scale Dataset for Supervised Dialogue Segmentation [55.82577086422923]
We provide a feasible definition of dialogue segmentation points with the help of document-grounded dialogues.
We release a large-scale supervised dataset called SuperDialseg, containing 9,478 dialogues.
We also provide a benchmark including 18 models across five categories for the dialogue segmentation task.
arXiv Detail & Related papers (2023-05-15T06:08:01Z) - "How Robust r u?": Evaluating Task-Oriented Dialogue Systems on Spoken
Conversations [87.95711406978157]
This work presents a new benchmark on spoken task-oriented conversations.
We study multi-domain dialogue state tracking and knowledge-grounded dialogue modeling.
Our data set enables speech-based benchmarking of task-oriented dialogue systems.
arXiv Detail & Related papers (2021-09-28T04:51:04Z) - What Helps Transformers Recognize Conversational Structure? Importance
of Context, Punctuation, and Labels in Dialog Act Recognition [41.1669799542627]
We apply two pre-trained transformer models to structure a conversational transcript as a sequence of dialog acts.
We find that the inclusion of a broader conversational context helps disambiguate many dialog act classes.
A detailed analysis reveals specific segmentation patterns observed in its absence.
arXiv Detail & Related papers (2021-07-05T21:56:00Z) - Reasoning in Dialog: Improving Response Generation by Context Reading
Comprehension [49.92173751203827]
In multi-turn dialog, utterances do not always take the full form of sentences.
We propose to improve the response generation performance by examining the model's ability to answer a reading comprehension question.
arXiv Detail & Related papers (2020-12-14T10:58:01Z) - RiSAWOZ: A Large-Scale Multi-Domain Wizard-of-Oz Dataset with Rich
Semantic Annotations for Task-Oriented Dialogue Modeling [35.75880078666584]
RiSAWOZ is a large-scale multi-domain Chinese Wizard-of-Oz dataset with Rich Semantic s.
It contains 11.2K human-to-human (H2H) multi-turn semantically annotated dialogues, with more than 150K utterances spanning over 12 domains.
arXiv Detail & Related papers (2020-10-17T08:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.