DEREC-SIMPRO: unlock Language Model benefits to advance Synthesis in Data Clean Room
- URL: http://arxiv.org/abs/2411.00879v1
- Date: Thu, 31 Oct 2024 13:02:55 GMT
- Title: DEREC-SIMPRO: unlock Language Model benefits to advance Synthesis in Data Clean Room
- Authors: Tung Sum Thomas Kwok, Chi-hua Wang, Guang Cheng,
- Abstract summary: We present the DEREC 3-step pre-processing pipeline to generalize adaptability of multi-table synthesizers.
We also introduce the SIMPRO 3-aspect evaluation metrics, which leverage conditional distribution and large-scale simultaneous hypothesis testing.
Results show that using DEREC improves fidelity, and multi-table synthesizers outperform single-table counterparts in collaboration settings.
- Score: 9.784347635082232
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data collaboration via Data Clean Room offers value but raises privacy concerns, which can be addressed through synthetic data and multi-table synthesizers. Common multi-table synthesizers fail to perform when subjects occur repeatedly in both tables. This is an urgent yet unresolved problem, since having both tables with repeating subjects is common. To improve performance in this scenario, we present the DEREC 3-step pre-processing pipeline to generalize adaptability of multi-table synthesizers. We also introduce the SIMPRO 3-aspect evaluation metrics, which leverage conditional distribution and large-scale simultaneous hypothesis testing to provide comprehensive feedback on synthetic data fidelity at both column and table levels. Results show that using DEREC improves fidelity, and multi-table synthesizers outperform single-table counterparts in collaboration settings. Together, the DEREC-SIMPRO pipeline offers a robust solution for generalizing data collaboration, promoting a more efficient, data-driven society.
Related papers
- Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.
Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z) - LLM-TabFlow: Synthetic Tabular Data Generation with Inter-column Logical Relationship Preservation [49.898152180805454]
This study is the first to explicitly address inter-column relationship preservation in synthetic tabular data generation.
LLM-TabFlow is a novel approach that captures complex inter-column relationships and compress data, while using Score-based Diffusion to model the distribution of the compressed data in latent space.
Our results show that LLM-TabFlow outperforms all baselines, fully preserving inter-column relationships while achieving the best balance between data fidelity, utility, and privacy.
arXiv Detail & Related papers (2025-03-04T00:47:52Z) - TD3: Tucker Decomposition Based Dataset Distillation Method for Sequential Recommendation [50.23504065567638]
This paper introduces textbfTD3, a novel textbfDataset textbfDistillation method within a meta-learning framework.
TD3 distills a fully expressive emphsynthetic sequence summary from original data.
An augmentation technique allows the learner to closely fit the synthetic summary, ensuring an accurate update of it in the emphouter-loop.
arXiv Detail & Related papers (2025-02-05T03:13:25Z) - SampleLLM: Optimizing Tabular Data Synthesis in Recommendations [46.689486044254544]
Tabular data synthesis is crucial in machine learning, yet existing general methods are highly data-dependent and often fall short in recommender systems.
This limitation arises from their difficulty in capturing complex distributions and understanding feature relationships from sparse and limited data.
We propose a novel two-stage framework named SampleLLM to improve the quality of LLM-based data synthesis for recommendation tasks.
arXiv Detail & Related papers (2025-01-27T15:12:27Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - CTSyn: A Foundational Model for Cross Tabular Data Generation [9.568990880984813]
Cross-Table Synthesizer (CTSyn) is a diffusion-based foundational model tailored for tabular data generation.
CTSyn significantly outperforms existing table synthesizers in utility and diversity.
It also uniquely enhances performances of downstream machine learning beyond what is achievable with real data.
arXiv Detail & Related papers (2024-06-07T04:04:21Z) - Adapting Differentially Private Synthetic Data to Relational Databases [9.532509662034062]
We introduce the first-of-its-kind algorithm that can be combined with any existing differentially private (DP) synthetic data generation mechanisms.
Our algorithm iteratively refines the relationship between individual synthetic tables to minimize their approximation errors.
arXiv Detail & Related papers (2024-05-29T00:25:07Z) - A Comparison of SynDiffix Multi-table versus Single-table Synthetic Data [0.7252027234425334]
SynDiffix is a new open-source tool for structured data synthesis.
It has anonymization features that allow it to generate multiple synthetic tables while maintaining strong anonymity.
This paper compares SynDiffix with 15 other synthetic data techniques using the SDNIST analysis framework.
arXiv Detail & Related papers (2024-03-13T12:26:50Z) - Retrosynthesis prediction enhanced by in-silico reaction data
augmentation [66.5643280109899]
We present RetroWISE, a framework that employs a base model inferred from real paired data to perform in-silico reaction generation and augmentation.
On three benchmark datasets, RetroWISE achieves the best overall performance against state-of-the-art models.
arXiv Detail & Related papers (2024-01-31T07:40:37Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data.
We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z) - Permutation-Invariant Tabular Data Synthesis [14.55825097637513]
We show that changing the input column order worsens the statistical difference between real and synthetic data by up to 38.67%.
We propose AE-GAN, a synthesizer that uses an autoencoder network to represent the tabular data and GAN networks to synthesize the latent representation.
We evaluate the proposed solutions on five datasets in terms of the sensitivity to the column permutation, the quality of synthetic data, and the utility in downstream analyses.
arXiv Detail & Related papers (2022-11-17T01:14:19Z) - Contrastive Self-supervised Sequential Recommendation with Robust
Augmentation [101.25762166231904]
Sequential Recommendationdescribes a set of techniques to model dynamic user behavior in order to predict future interactions in sequential user data.
Old and new issues remain, including data-sparsity and noisy data.
We propose Contrastive Self-Supervised Learning for sequential Recommendation (CoSeRec)
arXiv Detail & Related papers (2021-08-14T07:15:25Z) - SYNC: A Copula based Framework for Generating Synthetic Data from
Aggregated Sources [8.350531869939351]
We study synthetic data generation task called downscaling.
We propose a multi-stage framework called SYNC (Synthetic Data Generation via Gaussian Copula)
We make four key contributions in this work.
arXiv Detail & Related papers (2020-09-20T16:36:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.