FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLMs
- URL: http://arxiv.org/abs/2507.15839v1
- Date: Mon, 21 Jul 2025 17:51:46 GMT
- Title: FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLMs
- Authors: Anh Nguyen, Sam Schafft, Nicholas Hale, John Alfaro,
- Abstract summary: Synthetic data generation is an invaluable solution in scenarios where real-world data collection and usage are limited by cost and scarcity.<n>Existing approaches that directly use large language models to generate each record individually impose prohibitive time and cost burdens.<n>We propose a fast, cost-effective method for realistic tabular data synthesis that leverages LLMs to infer and encode each field's distribution into a reusable sampling script.
- Score: 3.703188184729035
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic data generation has emerged as an invaluable solution in scenarios where real-world data collection and usage are limited by cost and scarcity. Large language models (LLMs) have demonstrated remarkable capabilities in producing high-fidelity, domain-relevant samples across various fields. However, existing approaches that directly use LLMs to generate each record individually impose prohibitive time and cost burdens, particularly when large volumes of synthetic data are required. In this work, we propose a fast, cost-effective method for realistic tabular data synthesis that leverages LLMs to infer and encode each field's distribution into a reusable sampling script. By automatically classifying fields into numerical, categorical, or free-text types, the LLM generates distribution-based scripts that can efficiently produce diverse, realistic datasets at scale without continuous model inference. Experimental results show that our approach outperforms traditional direct methods in both diversity and data realism, substantially reducing the burden of high-volume synthetic data generation. We plan to apply this methodology to accelerate testing in production pipelines, thereby shortening development cycles and improving overall system efficiency. We believe our insights and lessons learned will aid researchers and practitioners seeking scalable, cost-effective solutions for synthetic data generation.
Related papers
- Large Language Models for Data Synthesis [17.333852085464176]
Large Language Models (LLMs) have potential as flexible, high-dimensional priors over real-world distributions.<n>We introduce LLM Synthor, a framework for data synthesis that transforms LLMs into structure-aware simulators guided by distributional feedback.<n>By minimizing discrepancies in the summary statistics space, the iterative synthesis loop aligns real and synthetic data.
arXiv Detail & Related papers (2025-05-20T13:35:38Z) - Synthline: A Product Line Approach for Synthetic Requirements Engineering Data Generation using Large Language Models [0.5156484100374059]
This paper introduces Synthline, a Product Line (PL) approach that leverages Large Language Models to generate synthetic Requirements Engineering (RE) data.<n>Our analysis reveals that while synthetic datasets exhibit less diversity than real data, they are good enough to serve as viable training resources.<n>Our evaluation shows that combining synthetic and real data leads to substantial performance improvements.
arXiv Detail & Related papers (2025-05-06T07:57:16Z) - Synthetic Data Generation Using Large Language Models: Advances in Text and Code [0.0]
Large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains.<n>We highlight key techniques such as prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement.<n>We discuss the accompanying challenges, including factual inaccuracies in generated text, insufficient stylistic or distributional realism, and risks of bias amplification.
arXiv Detail & Related papers (2025-03-18T08:34:03Z) - LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation [49.898152180805454]
This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation.<n>LLM-TabFlow is a novel approach that captures complex inter-column relationships and compress data, while using Score-based Diffusion to model the distribution of the compressed data in latent space.<n>Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy.
arXiv Detail & Related papers (2025-03-04T00:47:52Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification [7.357494019212501]
We propose efficient weighted-loss approaches to align synthetic data with real-world distribution.<n>We empirically assessed the effectiveness of our method on multiple text classification tasks.
arXiv Detail & Related papers (2024-10-28T20:53:49Z) - Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation [51.20656279478878]
MATRIX is a multi-agent simulator that automatically generates diverse text-based scenarios.<n>We introduce MATRIX-Gen for controllable and highly realistic data synthesis.<n>On AlpacaEval 2 and Arena-Hard benchmarks, Llama-3-8B-Base, post-trained on datasets synthesized by MATRIX-Gen with just 20K instruction-response pairs, outperforms Meta's Llama-3-8B-Instruct model.
arXiv Detail & Related papers (2024-10-18T08:01:39Z) - Efficacy of Synthetic Data as a Benchmark [3.2968976262860408]
We investigate the effectiveness of generating synthetic data through large language models (LLMs)
Our experiments show that while synthetic data can effectively capture performance of various methods for simpler tasks, it falls short for more complex tasks like named entity recognition.
We propose a new metric called the bias factor, which evaluates the biases introduced when the same LLM is used to both generate benchmarking data and to perform the tasks.
arXiv Detail & Related papers (2024-09-18T13:20:23Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.