Measuring LLM Sensitivity in Transformer-based Tabular Data Synthesis
- URL: http://arxiv.org/abs/2509.20768v1
- Date: Thu, 25 Sep 2025 05:48:48 GMT
- Title: Measuring LLM Sensitivity in Transformer-based Tabular Data Synthesis
- Authors: Maria F. Davila R, Azizjon Turaev, Wolfram Wingerath,
- Abstract summary: Transformer-based models outperform other state-of-the-art models in terms of data quality.<n>High computational costs make them sometimes unfeasible for end users with prosumer hardware.<n> GReaT consistently achieves lower runtimes than REaLTabFormer, and only on the largest dataset they have comparable runtime.
- Score: 0.451520252387513
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Synthetic tabular data is used for privacy-preserving data sharing and data-driven model development. Its effectiveness, however, depends heavily on the used Tabular Data Synthesis (TDS) tool. Recent studies have shown that Transformer-based models outperform other state-of-the-art models such as Generative Adversarial Networks (GANs) and Diffusion models in terms of data quality. However, Transformer-based models also come with high computational costs, making them sometimes unfeasible for end users with prosumer hardware. This study presents a sensitivity assessment on how the choice of hyperparameters, such as number of layers or hidden dimension affects the quality of the resultant synthetic data and the computational performance. It is performed across two tools, GReaT and REaLTabFormer, evaluating 10 model setups that vary in architecture type and depth. We assess the sensitivity on three dimensions: runtime, machine learning (ML) utility, and similarity to real data distributions. Experiments were conducted on four real-world datasets. Our findings reveal that runtime is proportional to the number of hyperparameters, with shallower configurations completing faster. GReaT consistently achieves lower runtimes than REaLTabFormer, and only on the largest dataset they have comparable runtime. For small datasets, both tools achieve synthetic data with high utility and optimal similarity, but on larger datasets only REaLTabFormer sustains strong utility and similarity. As a result, REaLTabFormer with lightweight LLMs provides the best balance, since it preserves data quality while reducing computational requirements. Nonetheless, its runtime remains higher than that of GReaT and other TDS tools, suggesting that efficiency gains are possible but only up to a certain level.
Related papers
- TABFAIRGDT: A Fast Fair Tabular Data Generator using Autoregressive Decision Trees [11.0044761900691]
We introduce TABFAIRGDT, a novel method for generating fair synthetic data using autoregressive decision trees.<n>We evaluate TABFAIRGDT on benchmark fairness datasets and demonstrate that it outperforms state-of-the-art (SOTA) deep generative models.<n>Remarkably, TABFAIRGDT achieves a 72% average speedup over the fastest SOTA baseline across various dataset sizes.
arXiv Detail & Related papers (2025-09-24T09:35:52Z) - Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision [0.13999481573773073]
Large Language Models (LLMs) excel at reranking due to their deep semantic understanding and reasoning.<n>Fine-tuning smaller, task-specific models is a more efficient alternative but typically on scarce, manually labeled data.<n>We propose a novel pipeline that eliminates the need for human-labeled query-document pairs.
arXiv Detail & Related papers (2025-09-23T09:47:27Z) - A Comparative Study of Open-Source Libraries for Synthetic Tabular Data Generation: SDV vs. SynthCity [0.0]
Synthetic data generators provide a promising solution by replicating the statistical and structural properties of real data.<n>This study evaluates the performance of six synthetic data generators from two widely used open-source libraries.
arXiv Detail & Related papers (2025-06-21T22:45:40Z) - Procedural Environment Generation for Tool-Use Agents [55.10427063893754]
We introduce RandomWorld, a pipeline for the procedural generation of interactive tools and compositional tool-use data.<n>We show that models tuned via SFT and RL on synthetic RandomWorld data improve on a range of tool-use benchmarks.
arXiv Detail & Related papers (2025-05-21T14:10:06Z) - Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.<n>Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - Efficacy of Synthetic Data as a Benchmark [3.2968976262860408]
We investigate the effectiveness of generating synthetic data through large language models (LLMs)
Our experiments show that while synthetic data can effectively capture performance of various methods for simpler tasks, it falls short for more complex tasks like named entity recognition.
We propose a new metric called the bias factor, which evaluates the biases introduced when the same LLM is used to both generate benchmarking data and to perform the tasks.
arXiv Detail & Related papers (2024-09-18T13:20:23Z) - Tabular Data Generation Models: An In-Depth Survey and Performance Benchmarks with Extensive Tuning [2.855894241049707]
We investigate the impact of dataset-specific tuning on five recent model families for tabular data generation through an extensive benchmark on 16 datasets.<n>Our benchmark demonstrates that, for most models, large-scale dataset-specific tuning substantially improves performance compared to the original configurations.
arXiv Detail & Related papers (2024-06-18T07:27:38Z) - Scaling Relationship on Learning Mathematical Reasoning with Large
Language Models [75.29595679428105]
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM.
We find that rejection samples from multiple models push LLaMA-7B to an accuracy of 49.3% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
arXiv Detail & Related papers (2023-08-03T15:34:01Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.