SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation
- URL: http://arxiv.org/abs/2405.10040v3
- Date: Wed, 13 Nov 2024 11:13:56 GMT
- Title: SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation
- Authors: Abhishek Divekar, Greg Durrett,
- Abstract summary: We study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor.
We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance.
- Score: 55.2480439325792
- License:
- Abstract: It is often desirable to distill the capabilities of large language models (LLMs) into smaller student models due to compute and memory constraints. One way to do this for classification tasks is via dataset synthesis, which can be accomplished by generating examples of each label from the LLM. Prior approaches to synthesis use few-shot prompting, which relies on the LLM's parametric knowledge to generate usable examples. However, this leads to issues of repetition, bias towards popular entities, and stylistic differences from human text. In this work, we propose Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval augmentation to introduce variety into the dataset synthesis process: as retrieved passages vary, the LLM is seeded with different content to generate its examples. We empirically study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor, requiring complex synthesis strategies. We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance, when compared to 32-shot prompting and four prior approaches. We release our code to perform all steps at https://github.com/amazon-science/synthesizrr
Related papers
- Understanding Synthetic Context Extension via Retrieval Heads [51.8869530817334]
We investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning.
We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted.
Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
arXiv Detail & Related papers (2024-10-29T17:55:00Z) - Training Language Models on Synthetic Edit Sequences Improves Code Synthesis [33.13471417703669]
Large language models (LLMs) autoresourcedly synthesize programs in a single pass.
We develop a synthetic data generation algorithm called LintSeq to generate high-quality code edit data.
We show that edit sequence finetuned models produce more diverse programs than baselines synthesis.
arXiv Detail & Related papers (2024-10-03T17:57:22Z) - Diversity-Driven Synthesis: Enhancing Dataset Distillation through Directed Weight Adjustment [39.137060714048175]
We argue that enhancing diversity can improve the parallelizable yet isolated approach to synthesizing datasets.
We introduce a novel method that employs dynamic and directed weight adjustment techniques to modulate the synthesis process.
Our method ensures that each batch of synthetic data mirrors the characteristics of a large, varying subset of the original dataset.
arXiv Detail & Related papers (2024-09-26T08:03:19Z) - Guiding Enumerative Program Synthesis with Large Language Models [15.500250058226474]
In this paper, we evaluate the abilities of Large Language Models to solve formal synthesis benchmarks.
When one-shot synthesis fails, we propose a novel enumerative synthesis algorithm.
We find that GPT-3.5 as a stand-alone tool for formal synthesis is easily outperformed by state-of-the-art formal synthesis algorithms.
arXiv Detail & Related papers (2024-03-06T19:13:53Z) - SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training? [57.42016037768947]
We present SynthCLIP, a CLIP model trained on entirely synthetic text-image pairs.
We generate synthetic datasets of images and corresponding captions at scale, with no human intervention.
arXiv Detail & Related papers (2024-02-02T18:59:58Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - Too Big to Fail? Active Few-Shot Learning Guided Logic Synthesis [18.961915757370466]
We propose Bulls-Eye, that fine-tunes a pre-trained model on past synthesis data to accurately predict the quality of a synthesis recipe for an unseen netlist.
This approach achieves 2x-10x run-time improvement and better quality-of-result (QoR) than state-of-the-art machine learning approaches.
arXiv Detail & Related papers (2022-04-05T17:18:04Z) - Mitigating Generation Shifts for Generalized Zero-Shot Learning [52.98182124310114]
Generalized Zero-Shot Learning (GZSL) is the task of leveraging semantic information (e.g., attributes) to recognize the seen and unseen samples, where unseen classes are not observable during training.
We propose a novel Generation Shifts Mitigating Flow framework for learning unseen data synthesis efficiently and effectively.
Experimental results demonstrate that GSMFlow achieves state-of-the-art recognition performance in both conventional and generalized zero-shot settings.
arXiv Detail & Related papers (2021-07-07T11:43:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.