Re$^3$Dial: Retrieve, Reorganize and Rescale Dialogue Corpus for
Long-Turn Open-Domain Dialogue Pre-training
- URL: http://arxiv.org/abs/2305.02606v2
- Date: Sun, 22 Oct 2023 07:41:06 GMT
- Title: Re$^3$Dial: Retrieve, Reorganize and Rescale Dialogue Corpus for
Long-Turn Open-Domain Dialogue Pre-training
- Authors: Jiaxin Wen, Hao Zhou, Jian Guan, Minlie Huang
- Abstract summary: Most dialogues in existing pre-training corpora contain fewer than three turns of dialogue.
We propose the Retrieve, Reorganize and Rescale framework (Re$3$Dial) to automatically construct billion-scale long-turn dialogues.
By repeating the above process, Re$3$Dial can yield a coherent long-turn dialogue.
- Score: 90.3412708846419
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-training on large-scale open-domain dialogue data can substantially
improve the performance of dialogue models. However, the pre-trained dialogue
model's ability to utilize long-range context is limited due to the scarcity of
long-turn dialogue sessions. Most dialogues in existing pre-training corpora
contain fewer than three turns of dialogue. To alleviate this issue, we propose
the Retrieve, Reorganize and Rescale framework (Re$^3$Dial), which can
automatically construct billion-scale long-turn dialogues by reorganizing
existing short-turn ones. Given a short-turn session, Re$^3$Dial first employs
a session retriever to retrieve coherent consecutive sessions. To this end, we
train the retriever to capture semantic and discourse relations within
multi-turn dialogues through contrastive training. Next, Re$^3$Dial samples a
session from retrieved results following a diversity sampling strategy, which
is designed to penalize repetitive or generic sessions. A longer session is
then derived by concatenating the original session and the sampled session. By
repeating the above process, Re$^3$Dial can yield a coherent long-turn
dialogue. Extensive experiments on multiple multi-turn dialogue benchmarks
demonstrate that Re$^3$Dial significantly improves the dialogue model's ability
to utilize long-range context and thus generate more sensible and informative
responses. Finally, we build a toolkit for efficiently rescaling conversations
with Re$^3$Dial, which enables us to construct a corpus containing 1B Chinese
dialogue sessions with 11.3 turns on average (5$\times$ longer than the
original corpus). Our retriever model, code, and data is publicly available at
\url{https://github.com/thu-coai/Re3Dial}.
Related papers
- Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation [55.043492250775294]
We introduce a novel Face-to-Face spoken dialogue model.
It processes audio-visual speech from user input and generates audio-visual speech as the response.
We also introduce MultiDialog, the first large-scale multimodal spoken dialogue corpus.
arXiv Detail & Related papers (2024-06-12T04:48:36Z) - Conversation Chronicles: Towards Diverse Temporal and Relational
Dynamics in Multi-Session Conversations [9.249662593315541]
We introduce a new 1M multi-session dialogue dataset, Conversation Chronicles, for implementing a long-term conversation setup.
We show that dialogue episodes in Conversation Chronicles reflect those properties while maintaining coherent and consistent interactions.
We also propose a dialogue model, called ReBot, which consists of chronological summarization and dialogue generation modules.
arXiv Detail & Related papers (2023-10-20T11:06:21Z) - DialoGPS: Dialogue Path Sampling in Continuous Semantic Space for Data
Augmentation in Multi-Turn Conversations [18.98951277038404]
In open-domain dialogue generation tasks, contexts and responses in most datasets are one-to-one mapped.
We propose DialoGue Path Sampling (DialoGPS) in continuous semantic space, the first many-to-many augmentation method for multi-turn dialogues.
arXiv Detail & Related papers (2023-06-29T08:12:47Z) - Pre-training Multi-party Dialogue Models with Latent Discourse Inference [85.9683181507206]
We pre-train a model that understands the discourse structure of multi-party dialogues, namely, to whom each utterance is replying.
To fully utilize the unlabeled data, we propose to treat the discourse structures as latent variables, then jointly infer them and pre-train the discourse-aware model.
arXiv Detail & Related papers (2023-05-24T14:06:27Z) - DIONYSUS: A Pre-trained Model for Low-Resource Dialogue Summarization [127.714919036388]
DIONYSUS is a pre-trained encoder-decoder model for summarizing dialogues in any new domain.
Our experiments show that DIONYSUS outperforms existing methods on six datasets.
arXiv Detail & Related papers (2022-12-20T06:21:21Z) - Controllable Dialogue Simulation with In-Context Learning [39.04491297557292]
textscDialogic is a dialogue simulation method based on large language model in-context learning.
Our method can rapidly expand a small set of dialogue data with minimum or zero human involvement.
Our simulated dialogues have near-human fluency and annotation accuracy.
arXiv Detail & Related papers (2022-10-09T06:32:58Z) - Sparse and Dense Approaches for the Full-rank Retrieval of Responses for
Dialogues [11.726528038065764]
We focus on the more realistic task of full-rank retrieval of responses, where $n$ can be up to millions of responses.
Our findings based on three different information-seeking dialogue datasets reveal that a learned response expansion technique is a solid baseline for sparse retrieval.
We find the best performing method overall to be dense retrieval with intermediate training, followed by fine-tuning on the target conversational data.
arXiv Detail & Related papers (2022-04-22T08:15:15Z) - Dialogue Summaries as Dialogue States (DS2), Template-Guided
Summarization for Few-shot Dialogue State Tracking [16.07100713414678]
Few-shot dialogue state tracking (DST) is a realistic solution to this problem.
We propose to reformulate dialogue state tracking as a dialogue summarization problem.
arXiv Detail & Related papers (2022-03-03T07:54:09Z) - Rethinking Dialogue State Tracking with Reasoning [76.0991910623001]
This paper proposes to track dialogue states gradually with reasoning over dialogue turns with the help of the back-end data.
Empirical results demonstrate that our method significantly outperforms the state-of-the-art methods by 38.6% in terms of joint belief accuracy for MultiWOZ 2.1.
arXiv Detail & Related papers (2020-05-27T02:05:33Z) - Dialogue-Based Relation Extraction [53.2896545819799]
We present the first human-annotated dialogue-based relation extraction (RE) dataset DialogRE.
We argue that speaker-related information plays a critical role in the proposed task, based on an analysis of similarities and differences between dialogue-based and traditional RE tasks.
Experimental results demonstrate that a speaker-aware extension on the best-performing model leads to gains in both the standard and conversational evaluation settings.
arXiv Detail & Related papers (2020-04-17T03:51:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.