Related papers: DiaSynth: Synthetic Dialogue Generation Framework for Low Resource Dialogue Applications

DiaSynth: Synthetic Dialogue Generation Framework for Low Resource Dialogue Applications

URL: http://arxiv.org/abs/2409.19020v2
Date: Tue, 15 Oct 2024 12:55:27 GMT
Title: DiaSynth: Synthetic Dialogue Generation Framework for Low Resource Dialogue Applications
Authors: Sathya Krishnan Suresh, Wu Mengjun, Tushar Pranav, Eng Siong Chng,
Abstract summary: Existing research is constrained by general or niche datasets that lack sufficient scale for training dialogue systems. We introduce Dia Synth - a synthetic dialogue generation framework capable of generating high-quality, contextually rich dialogues. We perform our experiments by generating synthetic data using different LLMs and few-shot examples from DialogSum and SAMSum.
Score: 18.378069426713
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The scarcity of domain-specific dialogue datasets limits the development of dialogue systems across applications. Existing research is constrained by general or niche datasets that lack sufficient scale for training dialogue systems. To address this gap, we introduce DiaSynth - a synthetic dialogue generation framework capable of generating high-quality, contextually rich dialogues across a wide range of domains. Unlike existing frameworks, DiaSynth uses Large Language Models (LLMs) and Chain of Thought (CoT) reasoning to generate dynamic, domain-specific dialogues with simulated personas and diverse conversational features. We perform our experiments by generating synthetic data using different LLMs and few-shot examples from DialogSum and SAMSum. The pretrained language models fine-tuned on the synthetic data outperform the base models by 16.47% on dialogue summarization, while the comparison between models fine-tuned on in-domain data and synthetic data shows that the synthetic data is able to capture 90.48% of the performance distribution of the in-domain data on dialogue summarization. The quality of the data generated also increases as we increase the size of LLM from 3B to 8B. These results validate DiaSynth's potential as a robust alternative to traditional data collection methods. We open source the code and data generated for future research.

Related papers

SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis [0.7919810878571298]
SDialog is a modular, realistic Python toolkit designed to address the challenges of synthetic dialogue generation and analysis.<n>By leveraging instruction-tuned Large Language Models (LLMs), SDialog provides abstractions for personas, orchestration, and scenario management.
arXiv Detail & Related papers (2025-06-12T12:07:51Z)
SPADE: Structured Prompting Augmentation for Dialogue Enhancement in Machine-Generated Text Detection [15.626772502710867]
We propose SPADE, a structured framework for detecting synthetic dialogues using prompt-based positive and negative samples.<n>Our proposed methods yield 14 new dialogue datasets, which we benchmark against eight MGT detection models.
arXiv Detail & Related papers (2025-03-19T09:32:52Z)
OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios [45.78414948567598]
We propose leveraging synthetic data to enhance the dialogue models across diverse scenarios. We introduce ShareChatX, the first comprehensive, large-scale dataset for spoken dialogue that spans diverse scenarios. We also explore critical aspects of training dialogue systems using synthetic data.
arXiv Detail & Related papers (2025-01-02T17:58:23Z)
ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis [80.34000499166648]
We propose a Graph-based Sampling strategy to sample more relevant tool combinations, and a Planned-generation strategy to create plans that guide the synthesis of coherent dialogues. We apply SFT on LLaMA-3.1-8B using 8,000 synthetic dialogues generated with ToolFlow. Results show that the model achieves tool-calling performance comparable to or even surpassing GPT-4, while maintaining strong general capabilities.
arXiv Detail & Related papers (2024-10-24T05:45:04Z)
Scalable Frame-based Construction of Sociocultural NormBases for Socially-Aware Dialogues [66.69453609603875]
Sociocultural norms serve as guiding principles for personal conduct in social interactions. We propose a scalable approach for constructing a Sociocultural Norm (SCN) Base using Large Language Models (LLMs) We construct a comprehensive and publicly accessible Chinese Sociocultural NormBase.
arXiv Detail & Related papers (2024-10-04T00:08:46Z)
A Framework for Synthetic Audio Conversations Generation using Large Language Models [0.0]
Conversa Synth is a framework designed to generate synthetic conversation audio using large language models (LLMs) with multiple persona settings. The framework first creates diverse and coherent text-based dialogues across various topics, which are then converted into audio using text-to-speech (TTS) systems.
arXiv Detail & Related papers (2024-09-02T05:09:46Z)
Self-Directed Synthetic Dialogues and Revisions Technical Report [16.587350874099638]
We introduce Self Directed Synthetic Dialogues (SDSD), an experimental dataset consisting of guided conversations of language models talking to themselves. SDSD consists of multi-turn conversations generated with DBRX, Llama 2 70B, and Mistral Large, all instructed to follow a conversation plan generated prior to the conversation.
arXiv Detail & Related papers (2024-07-25T22:42:36Z)
Simulating Task-Oriented Dialogues with State Transition Graphs and Large Language Models [16.94819621353007]
SynTOD is a new synthetic data generation approach for developing end-to-end Task-Oriented Dialogue (TOD) systems. It generates diverse, structured conversations through random walks and response simulation using large language models. In our experiments, using graph-guided response simulations leads to significant improvements in intent classification, slot filling and response relevance.
arXiv Detail & Related papers (2024-04-23T06:23:34Z)
Controllable and Diverse Data Augmentation with Large Language Model for Low-Resource Open-Domain Dialogue Generation [6.685921135304385]
We propose textbfSummary-based textbfDialogue textbfAugmentation with LLM. Our approach enhances the controllability of LLM by using dialogue summaries as a planning tool. Based on summaries, SDA can generate high-quality and diverse dialogue data even with a small seed dataset.
arXiv Detail & Related papers (2024-03-30T13:28:51Z)
Does Collaborative Human-LM Dialogue Generation Help Information Extraction from Human Dialogues? [55.28340832822234]
Problem-solving human dialogues in real applications can be much more complex than existing Wizard-of-Oz collections. We introduce a human-in-the-loop dialogue generation framework capable of synthesizing realistic dialogues.
arXiv Detail & Related papers (2023-07-13T20:02:50Z)
Controllable Dialogue Simulation with In-Context Learning [39.04491297557292]
textscDialogic is a dialogue simulation method based on large language model in-context learning. Our method can rapidly expand a small set of dialogue data with minimum or zero human involvement. Our simulated dialogues have near-human fluency and annotation accuracy.
arXiv Detail & Related papers (2022-10-09T06:32:58Z)
Dialogue Distillation: Open-Domain Dialogue Augmentation Using Unpaired Data [61.71319905364992]
We propose a novel data augmentation method for training open-domain dialogue models by utilizing unpaired data. A data-level distillation process is first proposed to construct augmented dialogues where both post and response are retrieved from the unpaired data. A ranking module is employed to filter out low-quality dialogues. A model-level distillation process is employed to distill a teacher model trained on high-quality paired data to augmented dialogue pairs.
arXiv Detail & Related papers (2020-09-20T13:06:38Z)
Paraphrase Augmented Task-Oriented Dialog Generation [68.1790912977053]
We propose a paraphrase augmented response generation (PARG) framework that jointly trains a paraphrase model and a response generation model. We also design a method to automatically construct paraphrase training data set based on dialog state and dialog act labels.
arXiv Detail & Related papers (2020-04-16T05:12:36Z)
Variational Hierarchical Dialog Autoencoder for Dialog State Tracking Data Augmentation [59.174903564894954]
In this work, we extend this approach to the task of dialog state tracking for goal-oriented dialogs. We propose the Variational Hierarchical Dialog Autoencoder (VHDA) for modeling the complete aspects of goal-oriented dialogs. Experiments on various dialog datasets show that our model improves the downstream dialog trackers' robustness via generative data augmentation.
arXiv Detail & Related papers (2020-01-23T15:34:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.