Related papers: SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation

SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation

URL: http://arxiv.org/abs/2405.10040v2
Date: Mon, 8 Jul 2024 11:20:42 GMT
Title: SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation
Authors: Abhishek Divekar, Greg Durrett,
Abstract summary: We study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor. We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance.
Score: 55.2480439325792
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: It is often desirable to distill the capabilities of large language models (LLMs) into smaller student models due to compute and memory constraints. One way to do this for classification tasks is via dataset synthesis, which can be accomplished by generating examples of each label from the LLM. Prior approaches to synthesis use few-shot prompting, which relies on the LLM's parametric knowledge to generate usable examples. However, this leads to issues of repetition, bias towards popular entities, and stylistic differences from human text. In this work, we propose Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval augmentation to introduce variety into the dataset synthesis process: as retrieved passages vary, the LLM is seeded with different content to generate its examples. We empirically study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor, requiring complex synthesis strategies. We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance, when compared to 32-shot prompting and four prior approaches. We release our extensive codebase at https://github.com/amazon-science/synthesizrr

Related papers

The Synthetic Imputation Approach: Generating Optimal Synthetic Texts For Underrepresented Categories In Supervised Classification Tasks [0.0]
It is often difficult to find sufficient examples for all categories in a task when building a high-quality training set. I propose a solution, the synthetic imputation approach. This approach generates synthetic texts based on careful prompting and five original examples drawn randomly with replacement from the sample. With 75 original examples or more, synthetic imputation's performance is on par with a full sample of original texts, and overfitting remains low, predictable and correctable with 50 original samples.
arXiv Detail & Related papers (2025-04-21T15:07:26Z)
Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
Few-shot LLM Synthetic Data with Distribution Matching [37.55363714371521]
Large language models (LLMs) produce high-quality synthetic data to enhance the performance of smaller models. LLMs-generated synthetic data often differs from the real data in key language attributes. We introduce SynAlign: a synthetic data generation and filtering framework based on key attribute distribution matching.
arXiv Detail & Related papers (2025-02-09T16:43:32Z)
Understanding Synthetic Context Extension via Retrieval Heads [51.8869530817334]
We investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
arXiv Detail & Related papers (2024-10-29T17:55:00Z)
Training Language Models on Synthetic Edit Sequences Improves Code Synthesis [33.13471417703669]
Large language models (LLMs) autoresourcedly synthesize programs in a single pass. We develop a synthetic data generation algorithm called LintSeq to generate high-quality code edit data. We show that edit sequence finetuned models produce more diverse programs than baselines synthesis.
arXiv Detail & Related papers (2024-10-03T17:57:22Z)
Diversity-Driven Synthesis: Enhancing Dataset Distillation through Directed Weight Adjustment [39.137060714048175]
We argue that enhancing diversity can improve the parallelizable yet isolated approach to synthesizing datasets. We introduce a novel method that employs dynamic and directed weight adjustment techniques to modulate the synthesis process. Our method ensures that each batch of synthetic data mirrors the characteristics of a large, varying subset of the original dataset.
arXiv Detail & Related papers (2024-09-26T08:03:19Z)
Guiding Enumerative Program Synthesis with Large Language Models [15.500250058226474]
In this paper, we evaluate the abilities of Large Language Models to solve formal synthesis benchmarks. When one-shot synthesis fails, we propose a novel enumerative synthesis algorithm. We find that GPT-3.5 as a stand-alone tool for formal synthesis is easily outperformed by state-of-the-art formal synthesis algorithms.
arXiv Detail & Related papers (2024-03-06T19:13:53Z)
SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training? [57.42016037768947]
We present SynthCLIP, a CLIP model trained on entirely synthetic text-image pairs. We generate synthetic datasets of images and corresponding captions at scale, with no human intervention.
arXiv Detail & Related papers (2024-02-02T18:59:58Z)
Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions. This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z)
Towards Automated Imbalanced Learning with Deep Hierarchical Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class. Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class. We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z)
Too Big to Fail? Active Few-Shot Learning Guided Logic Synthesis [18.961915757370466]
We propose Bulls-Eye, that fine-tunes a pre-trained model on past synthesis data to accurately predict the quality of a synthesis recipe for an unseen netlist. This approach achieves 2x-10x run-time improvement and better quality-of-result (QoR) than state-of-the-art machine learning approaches.
arXiv Detail & Related papers (2022-04-05T17:18:04Z)
Mitigating Generation Shifts for Generalized Zero-Shot Learning [52.98182124310114]
Generalized Zero-Shot Learning (GZSL) is the task of leveraging semantic information (e.g., attributes) to recognize the seen and unseen samples, where unseen classes are not observable during training. We propose a novel Generation Shifts Mitigating Flow framework for learning unseen data synthesis efficiently and effectively. Experimental results demonstrate that GSMFlow achieves state-of-the-art recognition performance in both conventional and generalized zero-shot settings.
arXiv Detail & Related papers (2021-07-07T11:43:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.