Controllable and Diverse Data Augmentation with Large Language Model for Low-Resource Open-Domain Dialogue Generation
- URL: http://arxiv.org/abs/2404.00361v1
- Date: Sat, 30 Mar 2024 13:28:51 GMT
- Title: Controllable and Diverse Data Augmentation with Large Language Model for Low-Resource Open-Domain Dialogue Generation
- Authors: Zhenhua Liu, Tong Zhu, Jianxiang Xiang, Wenliang Chen,
- Abstract summary: We propose textbfSummary-based textbfDialogue textbfAugmentation with LLM.
Our approach enhances the controllability of LLM by using dialogue summaries as a planning tool.
Based on summaries, SDA can generate high-quality and diverse dialogue data even with a small seed dataset.
- Score: 6.685921135304385
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Data augmentation (DA) is crucial to mitigate model training instability and over-fitting problems in low-resource open-domain dialogue generation. However, traditional DA methods often neglect semantic data diversity, restricting the overall quality. Recently, large language models (LLM) have been used for DA to generate diversified dialogues. However, they have limited controllability and tend to generate dialogues with a distribution shift compared to the seed dialogues. To maximize the augmentation diversity and address the controllability problem, we propose \textbf{S}ummary-based \textbf{D}ialogue \textbf{A}ugmentation with LLM (SDA). Our approach enhances the controllability of LLM by using dialogue summaries as a planning tool. Based on summaries, SDA can generate high-quality and diverse dialogue data even with a small seed dataset. To evaluate the efficacy of data augmentation methods for open-domain dialogue, we designed a clustering-based metric to characterize the semantic diversity of the augmented dialogue data. The experimental results show that SDA can augment high-quality and semantically diverse dialogues given a small seed dataset and an LLM, and the augmented data can boost the performance of open-domain dialogue models.
Related papers
- DFlow: Diverse Dialogue Flow Simulation with Large Language Models [16.209331014315463]
This paper proposes a novel data augmentation method designed to enhance the diversity of synthetic dialogues.
We generate a task-oriented dialogue dataset comprising 3,886 dialogue flows across 15 different domains.
arXiv Detail & Related papers (2024-10-18T20:35:28Z) - DiaSynth: Synthetic Dialogue Generation Framework for Low Resource Dialogue Applications [18.378069426713]
Existing research is constrained by general or niche datasets that lack sufficient scale for training dialogue systems.
We introduce Dia Synth - a synthetic dialogue generation framework capable of generating high-quality, contextually rich dialogues.
We perform our experiments by generating synthetic data using different LLMs and few-shot examples from DialogSum and SAMSum.
arXiv Detail & Related papers (2024-09-25T07:03:31Z) - Data Augmentation Integrating Dialogue Flow and Style to Adapt Spoken Dialogue Systems to Low-Resource User Groups [1.7725414095035827]
This study addresses the interaction challenges encountered by spoken dialogue systems (SDSs) when engaging with users who exhibit distinct conversational behaviors.
We propose a novel data augmentation framework to enhance SDS performance for user groups with limited resources.
arXiv Detail & Related papers (2024-08-20T03:33:04Z) - Plan, Generate and Complicate: Improving Low-resource Dialogue State Tracking via Easy-to-Difficult Zero-shot Data Augmentation [5.042738414157664]
We propose EDZ-DA, an Easy-to-Difficult Zero-shot Data Augmentation framework for low-resource dialogue state tracking.
We also complicate the dialogues based on the domain relation to enhance the model's capability for co-reference slot tracking.
arXiv Detail & Related papers (2024-06-13T06:49:03Z) - Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts [10.829227084902428]
We investigate the feasibility and effectiveness of Large Language Models (LLMs)-based data generation in source-grounded information-seeking dialogs.
We create MISeD -- Meeting Information Seeking Dialogs dataset -- a dataset of information-seeking dialogs focused on meeting transcripts.
Finetuning on MISeD gives comparable response generation quality to finetuning on fully manual data, while improving attribution quality and reducing time and effort.
arXiv Detail & Related papers (2024-05-02T09:35:06Z) - Enhancing Task Bot Engagement with Synthesized Open-Domain Dialog [89.35658776144638]
It is essential to build a system that can handle both TOD and ODD and access different knowledge sources.
We propose a framework for automatically generating dialogues that combine knowledge-grounded ODDs and TODs in various settings.
We introduce a unified model PivotBot that is capable of appropriately adopting TOD and ODD modes and accessing different knowledge sources.
arXiv Detail & Related papers (2022-12-20T05:51:47Z) - Weakly Supervised Data Augmentation Through Prompting for Dialogue
Understanding [103.94325597273316]
We present a novel approach that iterates on augmentation quality by applying weakly-supervised filters.
We evaluate our methods on the emotion and act classification tasks in DailyDialog and the intent classification task in Facebook Multilingual Task-Oriented Dialogue.
For DailyDialog specifically, using 10% of the ground truth data we outperform the current state-of-the-art model which uses 100% of the data.
arXiv Detail & Related papers (2022-10-25T17:01:30Z) - A Mixture-of-Expert Approach to RL-based Dialogue Management [56.08449336469477]
We use reinforcement learning to develop a dialogue agent that avoids being short-sighted (outputting generic utterances) and maximizes overall user satisfaction.
Most existing RL approaches to DM train the agent at the word-level, and thus, have to deal with aly complex action space even for a medium-size vocabulary.
We develop a RL-based DM using a novel mixture of expert language model (MoE-LM) that consists of (i) a LM capable of learning diverse semantics for conversation histories, (ii) a number of specialized LMs (or experts) capable of generating utterances corresponding to a
arXiv Detail & Related papers (2022-05-31T19:00:41Z) - Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding.
COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z) - Dialogue Distillation: Open-Domain Dialogue Augmentation Using Unpaired
Data [61.71319905364992]
We propose a novel data augmentation method for training open-domain dialogue models by utilizing unpaired data.
A data-level distillation process is first proposed to construct augmented dialogues where both post and response are retrieved from the unpaired data.
A ranking module is employed to filter out low-quality dialogues.
A model-level distillation process is employed to distill a teacher model trained on high-quality paired data to augmented dialogue pairs.
arXiv Detail & Related papers (2020-09-20T13:06:38Z) - Paraphrase Augmented Task-Oriented Dialog Generation [68.1790912977053]
We propose a paraphrase augmented response generation (PARG) framework that jointly trains a paraphrase model and a response generation model.
We also design a method to automatically construct paraphrase training data set based on dialog state and dialog act labels.
arXiv Detail & Related papers (2020-04-16T05:12:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.