Diverse and Effective Synthetic Data Generation for Adaptable Zero-Shot Dialogue State Tracking
- URL: http://arxiv.org/abs/2405.12468v2
- Date: Thu, 13 Jun 2024 17:32:00 GMT
- Title: Diverse and Effective Synthetic Data Generation for Adaptable Zero-Shot Dialogue State Tracking
- Authors: James D. Finch, Jinho D. Choi,
- Abstract summary: We show substantial performance gains in zero-shot dialogue state tracking (DST) by enhancing training data diversity through synthetic data generation.
Existing DST datasets are severely limited in the number of application domains and slot types they cover due to the high costs of data collection.
This work addresses this challenge with a novel, fully automatic data generation approach that creates synthetic zero-shot DST datasets.
- Score: 12.116834890063146
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We demonstrate substantial performance gains in zero-shot dialogue state tracking (DST) by enhancing training data diversity through synthetic data generation. Existing DST datasets are severely limited in the number of application domains and slot types they cover due to the high costs of data collection, restricting their adaptability to new domains. This work addresses this challenge with a novel, fully automatic data generation approach that creates synthetic zero-shot DST datasets. Distinguished from previous methods, our approach can generate dialogues across a massive range of application domains, complete with silver-standard dialogue state annotations and slot descriptions. This technique is used to create the D0T dataset for training zero-shot DST models, encompassing an unprecedented 1,000+ domains. Experiments on the MultiWOZ benchmark show that training models on diverse synthetic data improves Joint Goal Accuracy by 6.7%, achieving results competitive with models 13.5 times larger than ours.
Related papers
- REFINE on Scarce Data: Retrieval Enhancement through Fine-Tuning via Model Fusion of Embedding Models [14.023953508288628]
Retrieval augmented generation (RAG) pipelines are commonly used in tasks such as question-answering (QA)
We propose REFINE, a novel technique that generates synthetic data from available documents and then uses a model fusion approach to fine-tune embeddings.
arXiv Detail & Related papers (2024-10-16T08:43:39Z) - TarGEN: Targeted Data Generation with Large Language Models [51.87504111286201]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets.
We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances.
A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - UNO-DST: Leveraging Unlabelled Data in Zero-Shot Dialogue State Tracking [54.51316566989655]
Previous zero-shot dialogue state tracking (DST) methods only apply transfer learning, ignoring unlabelled data in the target domain.
We transform zero-shot DST into few-shot DST by utilising such unlabelled data via joint and self-training methods.
We demonstrate this method's effectiveness on general language models in zero-shot scenarios, improving average joint goal accuracy by 8% across all domains in MultiWOZ.
arXiv Detail & Related papers (2023-10-16T15:16:16Z) - Divide, Conquer, and Combine: Mixture of Semantic-Independent Experts
for Zero-Shot Dialogue State Tracking [83.40120598637665]
Zero-shot transfer learning for Dialogue State Tracking (DST) helps to handle a variety of task-oriented dialogue domains without the cost of collecting in-domain data.
Existing works mainly study common data- or model-level augmentation methods to enhance the generalization.
We present a simple and effective "divide, conquer and combine" solution, which explicitly disentangles the semantics of seen data.
arXiv Detail & Related papers (2023-06-01T08:21:20Z) - Bridging the Gap: Enhancing the Utility of Synthetic Data via
Post-Processing Techniques [7.967995669387532]
generative models have emerged as a promising solution for generating synthetic datasets that can replace or augment real-world data.
We propose three novel post-processing techniques to improve the quality and diversity of the synthetic dataset.
Experiments show that Gap Filler (GaFi) effectively reduces the gap with real-accuracy scores to an error of 2.03%, 1.78%, and 3.99% on the Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets, respectively.
arXiv Detail & Related papers (2023-05-17T10:50:38Z) - ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback [21.168991554983815]
We propose a progressive zero-shot dataset generation framework, ProGen, to guide the generation of new training data.
We show ProGen achieves on-par or superior performance with only 1% synthetic dataset size.
arXiv Detail & Related papers (2022-10-22T02:07:10Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Improving Zero and Few-Shot Abstractive Summarization with Intermediate
Fine-tuning and Data Augmentation [101.26235068460551]
Models pretrained with self-supervised objectives on large text corpora achieve state-of-the-art performance on English text summarization tasks.
Models are typically fine-tuned on hundreds of thousands of data points, an infeasible requirement when applying summarization to new, niche domains.
We introduce a novel and generalizable method, called WikiTransfer, for fine-tuning pretrained models for summarization in an unsupervised, dataset-specific manner.
arXiv Detail & Related papers (2020-10-24T08:36:49Z) - Zero-Shot Transfer Learning with Synthesized Data for Multi-Domain
Dialogue State Tracking [8.151397072537797]
We show that data augmentation through synthesized data can improve the accuracy of zero-shot learning.
We improve the zero-shot learning state of the art on average across domains by 21%.
arXiv Detail & Related papers (2020-05-02T18:00:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.