Dialog Inpainting: Turning Documents into Dialogs
- URL: http://arxiv.org/abs/2205.09073v1
- Date: Wed, 18 May 2022 16:58:50 GMT
- Title: Dialog Inpainting: Turning Documents into Dialogs
- Authors: Zhuyun Dai, Arun Tejasvi Chaganty, Vincent Zhao, Aida Amini, Qazi
Mamunur Rashid, Mike Green, Kelvin Guu
- Abstract summary: We produce two datasets totalling 19 million diverse information-seeking dialogs.
Human raters judge the answer adequacy and conversationality of WikiDialog to be as good or better than existing manually-collected datasets.
- Score: 12.131506050808207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many important questions (e.g. "How to eat healthier?") require conversation
to establish context and explore in depth. However, conversational question
answering (ConvQA) systems have long been stymied by scarce training data that
is expensive to collect. To address this problem, we propose a new technique
for synthetically generating diverse and high-quality dialog data: dialog
inpainting. Our approach takes the text of any document and transforms it into
a two-person dialog between the writer and an imagined reader: we treat
sentences from the article as utterances spoken by the writer, and then use a
dialog inpainter to predict what the imagined reader asked or said in between
each of the writer's utterances. By applying this approach to passages from
Wikipedia and the web, we produce WikiDialog and WebDialog, two datasets
totalling 19 million diverse information-seeking dialogs -- 1,000x larger than
the largest existing ConvQA dataset. Furthermore, human raters judge the answer
adequacy and conversationality of WikiDialog to be as good or better than
existing manually-collected datasets. Using our inpainted data to pre-train
ConvQA retrieval systems, we significantly advance state-of-the-art across
three benchmarks (QReCC, OR-QuAC, TREC CAsT) yielding up to 40% relative gains
on standard evaluation metrics.
Related papers
- Synthesizing Conversations from Unlabeled Documents using Automatic Response Segmentation [13.322409682814827]
We tackle the challenge of inadequate and costly training data for conversational question answering systems.
In this paper, we propose a robust dialog synthesising method.
We learn the segmentation of data for the dialog task instead of using segmenting at sentence boundaries.
arXiv Detail & Related papers (2024-06-06T02:52:45Z) - Dialogizer: Context-aware Conversational-QA Dataset Generation from
Textual Sources [18.09705075305591]
We propose a novel framework called Dialogizer, which has the capability to automatically generate ConvQA datasets with high contextual relevance.
We produce four ConvQA datasets by utilizing documents from multiple domains as the primary source.
arXiv Detail & Related papers (2023-11-09T06:03:11Z) - CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog
Evaluation [75.60156479374416]
CGoDial is a new challenging and comprehensive Chinese benchmark for Goal-oriented Dialog evaluation.
It contains 96,763 dialog sessions and 574,949 dialog turns totally, covering three datasets with different knowledge sources.
To bridge the gap between academic benchmarks and spoken dialog scenarios, we either collect data from real conversations or add spoken features to existing datasets via crowd-sourcing.
arXiv Detail & Related papers (2022-11-21T16:21:41Z) - Controllable Dialogue Simulation with In-Context Learning [39.04491297557292]
textscDialogic is a dialogue simulation method based on large language model in-context learning.
Our method can rapidly expand a small set of dialogue data with minimum or zero human involvement.
Our simulated dialogues have near-human fluency and annotation accuracy.
arXiv Detail & Related papers (2022-10-09T06:32:58Z) - End-to-end Spoken Conversational Question Answering: Task, Dataset and
Model [92.18621726802726]
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts.
We propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows.
Our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering.
arXiv Detail & Related papers (2022-04-29T17:56:59Z) - HybriDialogue: An Information-Seeking Dialogue Dataset Grounded on
Tabular and Textual Data [87.67278915655712]
We present a new dialogue dataset, HybriDialogue, which consists of crowdsourced natural conversations grounded on both Wikipedia text and tables.
The conversations are created through the decomposition of complex multihop questions into simple, realistic multiturn dialogue interactions.
arXiv Detail & Related papers (2022-04-28T00:52:16Z) - What Did You Say? Task-Oriented Dialog Datasets Are Not Conversational!? [4.022057598291766]
We outline a taxonomy of conversational and contextual effects, which we use to examine MultiWOZ, SGD and SMCalFlow.
We find that less than 4% of MultiWOZ's turns and 10% of SGD's turns are conversational, while SMCalFlow is not conversational at all in its current release.
arXiv Detail & Related papers (2022-03-07T14:26:23Z) - DG2: Data Augmentation Through Document Grounded Dialogue Generation [41.81030088619399]
We propose an automatic data augmentation technique grounded on documents through a generative dialogue model.
When supplementing the original dataset, our method achieves significant improvement over traditional data augmentation methods.
arXiv Detail & Related papers (2021-12-15T18:50:14Z) - Converse, Focus and Guess -- Towards Multi-Document Driven Dialogue [53.380996227212165]
We propose a novel task, Multi-Document Driven Dialogue (MD3), in which an agent can guess the target document that the user is interested in by leading a dialogue.
GuessMovie contains 16,881 documents, each describing a movie, and associated 13,434 dialogues.
Our method significantly outperforms several strong baseline methods and is very close to human's performance.
arXiv Detail & Related papers (2021-02-04T06:36:11Z) - Reasoning in Dialog: Improving Response Generation by Context Reading
Comprehension [49.92173751203827]
In multi-turn dialog, utterances do not always take the full form of sentences.
We propose to improve the response generation performance by examining the model's ability to answer a reading comprehension question.
arXiv Detail & Related papers (2020-12-14T10:58:01Z) - Rethinking Dialogue State Tracking with Reasoning [76.0991910623001]
This paper proposes to track dialogue states gradually with reasoning over dialogue turns with the help of the back-end data.
Empirical results demonstrate that our method significantly outperforms the state-of-the-art methods by 38.6% in terms of joint belief accuracy for MultiWOZ 2.1.
arXiv Detail & Related papers (2020-05-27T02:05:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.