A Technical Exploration of Causal Inference with Hybrid LLM Synthetic Data
- URL: http://arxiv.org/abs/2511.00318v1
- Date: Fri, 31 Oct 2025 23:34:44 GMT
- Title: A Technical Exploration of Causal Inference with Hybrid LLM Synthetic Data
- Authors: Dana Kim, Yichen Xu, Tiffany Lin,
- Abstract summary: Large Language Models (LLMs) offer a flexible means to generate synthetic data.<n>Existing approaches often fail to preserve key causal parameters such as the average treatment effect (ATE)
- Score: 3.121656940390038
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) offer a flexible means to generate synthetic tabular data, yet existing approaches often fail to preserve key causal parameters such as the average treatment effect (ATE). In this technical exploration, we first demonstrate that state-of-the-art synthetic data generators, both GAN- and LLM-based, can achieve high predictive fidelity while substantially misestimating causal effects. To address this gap, we propose a hybrid generation framework that combines model-based covariate synthesis (monitored via distance-to-closest-record filtering) with separately learned propensity and outcome models, thereby ensuring that (W, A, Y) triplets retain their underlying causal structure. We further introduce a synthetic pairing strategy to mitigate positivity violations and a realistic evaluation protocol that leverages unlimited synthetic samples to benchmark traditional estimators (IPTW, AIPW, substitution) under complex covariate distributions. This work lays the groundwork for LLM-powered data pipelines that support robust causal analysis. Our code is available at https://github.com/Xyc-arch/llm-synthetic-for-causal-inference.git.
Related papers
- Sharp Convergence Rates for Masked Diffusion Models [53.117058231393834]
We develop a total-variation based analysis for the Euler method that overcomes limitations.<n>Our results relax assumptions on score estimation, improve parameter dependencies, and establish convergence guarantees.<n>Overall, our analysis introduces a direct TV-based error decomposition along the CTMC trajectory and a decoupling-based path-wise analysis for FHS.
arXiv Detail & Related papers (2026-02-26T00:47:51Z) - Towards Syn-to-Real IQA: A Novel Perspective on Reshaping Synthetic Data Distributions [74.00222571094437]
Blind Image Quality Assessment (BIQA) has advanced significantly through deep learning, but the scarcity of large-scale labeled datasets remains a challenge.<n>We make a key observation that representations learned from synthetic datasets often exhibit a discrete and clustered pattern that hinders regression performance.<n>We introduce a novel framework SynDR-IQA, which reshapes synthetic data distribution to enhance BIQA generalization.
arXiv Detail & Related papers (2026-01-01T06:11:16Z) - MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data [2.1127261244588156]
We investigate the impact of synthetic data generation and fine-tuning techniques on the ability of large language models to recognize fallacious arguments.<n>In this work, we propose Mis Synth, a pipeline that applies retrieval-augmented generation (RAG) to produce synthetic fallacy samples.<n>Our results show substantial accuracy gains with fine-tuned models compared to vanilla baselines.
arXiv Detail & Related papers (2025-10-30T10:52:43Z) - Beyond Real Data: Synthetic Data through the Lens of Regularization [9.459299281438074]
Synthetic data can improve generalization when real data is scarce, but excessive reliance may introduce distributional mismatches that degrade performance.<n>We present a learning-theoretic framework to quantify the trade-off between synthetic and real data.
arXiv Detail & Related papers (2025-10-09T11:33:09Z) - Valid Inference with Imperfect Synthetic Data [39.10587411316875]
We introduce a new estimator based on generalized method of moments.<n>We find that interactions between the moment residuals of synthetic data and those of real data can greatly improve estimates of the target parameter.
arXiv Detail & Related papers (2025-08-08T18:32:52Z) - LLMSynthor: Macro-Aligned Micro-Records Synthesis with Large Language Models [20.767947974005168]
LLM Synthor is a macro-aware simulator that generates realistic micro-records consistent with target macro-statistics.<n>It iteratively builds synthetic datasets to minimize discrepancies between synthetic and target aggregates.<n>It achieves strong realism, statistical fidelity, and practical utility, making it broadly applicable to economics, social science, and urban studies.
arXiv Detail & Related papers (2025-05-20T13:35:38Z) - Synthline: A Product Line Approach for Synthetic Requirements Engineering Data Generation using Large Language Models [0.5156484100374059]
This paper introduces Synthline, a Product Line (PL) approach that leverages Large Language Models to generate synthetic Requirements Engineering (RE) data.<n>Our analysis reveals that while synthetic datasets exhibit less diversity than real data, they are good enough to serve as viable training resources.<n>Our evaluation shows that combining synthetic and real data leads to substantial performance improvements.
arXiv Detail & Related papers (2025-05-06T07:57:16Z) - LLM-TabLogic: Preserving Inter-Column Logical Relationships in Synthetic Tabular Data via Prompt-Guided Latent Diffusion [49.898152180805454]
Synthetic datasets must maintain domain-specific logical consistency.<n>Existing generative models often overlook these inter-column relationships.<n>This study presents the first method to effectively preserve inter-column relationships without requiring domain knowledge.
arXiv Detail & Related papers (2025-03-04T00:47:52Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance [16.047084318753377]
Imbalanced classification and spurious correlation are common challenges in data science and machine learning.<n>Recent advances have proposed leveraging the flexibility and generative capabilities of large language models to generate synthetic samples.<n>This article develops novel theoretical foundations to systematically study the roles of synthetic samples in addressing imbalanced classification and spurious correlation.
arXiv Detail & Related papers (2024-06-05T21:24:26Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - Autoregressive Score Matching [113.4502004812927]
We propose autoregressive conditional score models (AR-CSM) where we parameterize the joint distribution in terms of the derivatives of univariable log-conditionals (scores)
For AR-CSM models, this divergence between data and model distributions can be computed and optimized efficiently, requiring no expensive sampling or adversarial training.
We show with extensive experimental results that it can be applied to density estimation on synthetic data, image generation, image denoising, and training latent variable models with implicit encoders.
arXiv Detail & Related papers (2020-10-24T07:01:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.