Related papers: StructSynth: Leveraging LLMs for Structure-Aware Tabular Data Synthesis in Low-Data Regimes

StructSynth: Leveraging LLMs for Structure-Aware Tabular Data Synthesis in Low-Data Regimes

URL: http://arxiv.org/abs/2508.02601v1
Date: Mon, 04 Aug 2025 16:55:02 GMT
Title: StructSynth: Leveraging LLMs for Structure-Aware Tabular Data Synthesis in Low-Data Regimes
Authors: Siyi Liu, Yujia Zheng, Yongqi Zhang,
Abstract summary: Struct Synth is a novel framework that integrates the generative power of Large Language Models with robust structural control.<n>It produces synthetic data with significantly higher structural integrity and downstream utility than state-of-the-art methods.<n>It proves especially effective in challenging low-data scenarios, successfully navigating the trade-off between privacy preservation and statistical fidelity.
Score: 15.476662936746989
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The application of machine learning on tabular data in specialized domains is severely limited by data scarcity. While generative models offer a solution, traditional methods falter in low-data regimes, and recent Large Language Models (LLMs) often ignore the explicit dependency structure of tabular data, leading to low-fidelity synthetics. To address these limitations, we introduce StructSynth, a novel framework that integrates the generative power of LLMs with robust structural control. StructSynth employs a two-stage architecture. First, it performs explicit structure discovery to learn a Directed Acyclic Graph (DAG) from the available data. Second, this learned structure serves as a high-fidelity blueprint to steer the LLM's generation process, forcing it to adhere to the learned feature dependencies and thereby ensuring the generated data respects the underlying structure by design. Our extensive experiments demonstrate that StructSynth produces synthetic data with significantly higher structural integrity and downstream utility than state-of-the-art methods. It proves especially effective in challenging low-data scenarios, successfully navigating the trade-off between privacy preservation and statistical fidelity.

Related papers

Dependency-aware synthetic tabular data generation [0.0]
In particular, functional dependencies (FDs) and logical dependencies (LDs) are rarely or often poorly retained in synthetic datasets.<n>We propose the Hierarchical Feature Generation Framework (HFGF), which generates independent features and reconstructs dependent features based on FD and LD rules.<n>Our experiments on four benchmark datasets demonstrate that HFGF improves the preservation of FDs and LDs across six generative models.
arXiv Detail & Related papers (2025-07-25T12:29:58Z)
Generating Synthetic Relational Tabular Data via Structural Causal Models [0.0]
We develop a novel framework that generates realistic synthetic relational data including causal relationships across tables.<n>Our experiments confirm that this framework is able to construct relational datasets with complex inter-table dependencies mimicking real-world scenarios.
arXiv Detail & Related papers (2025-07-04T12:27:23Z)
Large Language Models are Good Relational Learners [55.40941576497973]
We introduce Rel-LLM, a novel architecture that utilizes a graph neural network (GNN)- based encoder to generate structured relational prompts for large language models (LLMs)<n>Unlike traditional text-based serialization approaches, our method preserves the inherent relational structure of databases while enabling LLMs to process and reason over complex entity relationships.
arXiv Detail & Related papers (2025-06-06T04:07:55Z)
RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models [83.6013616017646]
RelDiff is a novel diffusion generative model that synthesizes complete relational databases by explicitly modeling their foreign key graph structure.<n>RelDiff consistently outperforms prior methods in producing realistic and coherent synthetic relational databases.
arXiv Detail & Related papers (2025-05-31T21:01:02Z)
Large Language Models for Data Synthesis [17.333852085464176]
Large Language Models (LLMs) have potential as flexible, high-dimensional priors over real-world distributions.<n>We introduce LLM Synthor, a framework for data synthesis that transforms LLMs into structure-aware simulators guided by distributional feedback.<n>By minimizing discrepancies in the summary statistics space, the iterative synthesis loop aligns real and synthetic data.
arXiv Detail & Related papers (2025-05-20T13:35:38Z)
GReaTER: Generate Realistic Tabular data after data Enhancement and Reduction [9.784347635082232]
We propose GReaTER to generate realistic Tabular Data.<n> GReaTER includes a data semantic enhancement system and a cross-table connecting method.<n> Experimental results show that GReaTER outperforms the GReaT framework.
arXiv Detail & Related papers (2025-03-19T04:16:05Z)
Structural and Statistical Texture Knowledge Distillation and Learning for Segmentation [70.15341084443236]
We re-emphasize the low-level texture information in deep networks for semantic segmentation and related knowledge distillation tasks.<n>We propose a novel Structural and Statistical Texture Knowledge Distillation (SSTKD) framework for semantic segmentation.<n>Specifically, Contourlet Decomposition Module (CDM) is introduced to decompose the low-level features.<n> Texture Intensity Equalization Module (TIEM) is designed to extract and enhance the statistical texture knowledge.
arXiv Detail & Related papers (2025-03-11T04:49:25Z)
LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation [49.898152180805454]
This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation.<n>LLM-TabFlow is a novel approach that captures complex inter-column relationships and compress data, while using Score-based Diffusion to model the distribution of the compressed data in latent space.<n>Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy.
arXiv Detail & Related papers (2025-03-04T00:47:52Z)
HyperG: Hypergraph-Enhanced LLMs for Structured Knowledge [25.279158571663036]
HyperG is a hypergraph-based generation framework aimed at enhancing Large Language Models' ability to process structured knowledge.<n>Specifically, HyperG first augments sparse data with contextual information, and incorporate a prompt-attentive hypergraph learning network to encode both the augmented information and the intricate structural relationships within the data.<n>To validate the effectiveness and generalization of HyperG, we conduct extensive experiments across two different downstream tasks requiring structured knowledge.
arXiv Detail & Related papers (2025-02-25T11:47:32Z)
Learning to Model Graph Structural Information on MLPs via Graph Structure Self-Contrasting [50.181824673039436]
We propose a Graph Structure Self-Contrasting (GSSC) framework that learns graph structural information without message passing. The proposed framework is based purely on Multi-Layer Perceptrons (MLPs), where the structural information is only implicitly incorporated as prior knowledge. It first applies structural sparsification to remove potentially uninformative or noisy edges in the neighborhood, and then performs structural self-contrasting in the sparsified neighborhood to learn robust node representations.
arXiv Detail & Related papers (2024-09-09T12:56:02Z)
StructLM: Towards Building Generalist Models for Structured Knowledge Grounding [49.10029030628653]
Large language models' (LLMs) ability to process structured data lags behind state-of-the-art (SoTA) model by an average of 35%. We train a series of models, referred to as StructLM, based on the Mistral and the CodeLlama model family, ranging from 7B to 34B parameters. Our StructLM series surpasses task-specific models on 16 out of 18 evaluated datasets and establishes new SoTA performance on 8 SKG tasks.
arXiv Detail & Related papers (2024-02-26T15:47:01Z)
RDesign: Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design [65.41144149958208]
This study aims to systematically construct a data-driven RNA design pipeline. We crafted a benchmark dataset and designed a comprehensive structural modeling approach to represent the complex RNA tertiary structure. We incorporated extracted secondary structures with base pairs as prior knowledge to facilitate the RNA design process.
arXiv Detail & Related papers (2023-01-25T17:19:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.