Related papers: Generative Modeling of Complex Data

Generative Modeling of Complex Data

URL: http://arxiv.org/abs/2202.02145v1
Date: Fri, 4 Feb 2022 14:17:26 GMT
Title: Generative Modeling of Complex Data
Authors: Luca Canale, Nicolas Grislain, Gr\'egoire Lothe and Johan Leduc
Abstract summary: This paper puts forward a generic framework to synthesize more complex data structures with composite and nested types. The results on standard benchmark datasets show that such implementation consistently outperforms current state-of-the-art models.
Score: 8.201100713224003
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, several models have improved the capacity to generate synthetic tabular datasets. However, such models focus on synthesizing simple columnar tables and are not useable on real-life data with complex structures. This paper puts forward a generic framework to synthesize more complex data structures with composite and nested types. It then proposes one practical implementation, built with causal transformers, for struct (mappings of types) and lists (repeated instances of a type). The results on standard benchmark datasets show that such implementation consistently outperforms current state-of-the-art models both in terms of machine learning utility and statistical similarity. Moreover, it shows very strong results on two complex hierarchical datasets with multiple nesting and sparse data, that were previously out of reach.

Related papers

Generating Synthetic Relational Tabular Data via Structural Causal Models [0.0]
We develop a novel framework that generates realistic synthetic relational data including causal relationships across tables.<n>Our experiments confirm that this framework is able to construct relational datasets with complex inter-table dependencies mimicking real-world scenarios.
arXiv Detail & Related papers (2025-07-04T12:27:23Z)
RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models [83.6013616017646]
RelDiff is a novel diffusion generative model that synthesizes complete relational databases by explicitly modeling their foreign key graph structure.<n>RelDiff consistently outperforms prior methods in producing realistic and coherent synthetic relational databases.
arXiv Detail & Related papers (2025-05-31T21:01:02Z)
How compositional generalization and creativity improve as diffusion models are trained [82.08869888944324]
How many samples do generative models need in order to learn composition rules? What signal in the data is exploited to learn those rules? We discuss connections between the hierarchical clustering mechanism we introduce here and the renormalization group in physics.
arXiv Detail & Related papers (2025-02-17T18:06:33Z)
TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all multi-modal distributions of tabular data in one model. Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data. TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z)
TabEBM: A Tabular Data Augmentation Method with Distinct Class-Specific Energy-Based Models [10.88959673845634]
TabEBM is a class-conditional generative method using Energy-Based Models (EBMs) Our experiments show that TabEBM generates synthetic data with higher quality and better statistical fidelity than existing methods.
arXiv Detail & Related papers (2024-09-24T14:25:59Z)
UniTST: Effectively Modeling Inter-Series and Intra-Series Dependencies for Multivariate Time Series Forecasting [98.12558945781693]
We propose a transformer-based model UniTST containing a unified attention mechanism on the flattened patch tokens. Although our proposed model employs a simple architecture, it offers compelling performance as shown in our experiments on several datasets for time series forecasting.
arXiv Detail & Related papers (2024-06-07T14:39:28Z)
CTSyn: A Foundational Model for Cross Tabular Data Generation [9.568990880984813]
Cross-Table Synthesizer (CTSyn) is a diffusion-based foundational model tailored for tabular data generation. CTSyn significantly outperforms existing table synthesizers in utility and diversity. It also uniquely enhances performances of downstream machine learning beyond what is achievable with real data.
arXiv Detail & Related papers (2024-06-07T04:04:21Z)
ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models [65.82630283336051]
We show that the space spanned by the combination of dimensions and attributes is insufficiently sampled by existing training scheme of diffusion generative models. We present a simple fix to this problem by constructing processes that fully exploit the structures, hence the name ComboStoc.
arXiv Detail & Related papers (2024-05-22T15:23:10Z)
Data Factors for Better Compositional Generalization [60.698130703909804]
We conduct an empirical analysis by training Transformer models on a variety of training sets with different data factors. We show that increased dataset complexity can lead to better generalization behavior on multiple different generalization challenges. We explore how training examples of different difficulty levels influence generalization differently.
arXiv Detail & Related papers (2023-11-08T01:27:34Z)
AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing [12.06889830487286]
Diffusion model has become a main paradigm for synthetic data generation in modern machine learning. In this paper, we leverage the power of diffusion model for generating synthetic tabular data. The resulting synthetic tables show nice statistical fidelities to the real data, and perform well in downstream tasks for machine learning utilities.
arXiv Detail & Related papers (2023-10-24T03:15:19Z)
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data. We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap. Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z)
REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers [0.0]
We introduce REaLTabFormer (Realistic and Tabular Transformer), a synthetic data generation model. It first creates a parent table using an autoregressive GPT-2 model, then generates the relational dataset conditioned on the parent table using a sequence-to-sequence model. Experiments using real-world datasets show that REaLTabFormer captures the relational structure better than a model baseline.
arXiv Detail & Related papers (2023-02-04T00:32:50Z)
Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z)
Tabular Transformers for Modeling Multivariate Time Series [30.717890753132824]
Tabular datasets are ubiquitous in data science applications. Given their importance, it seems natural to apply state-of-the-art deep learning algorithms in order to fully unlock their potential. Here we propose neural network models that represent tabular time series that can leverage their hierarchical structure. We demonstrate our models on two datasets: a synthetic credit card transaction dataset, where the learned representations are used for fraud detection and synthetic data generation, and on a real pollution dataset, where the learned encodings are used to predict atmospheric pollutant concentrations.
arXiv Detail & Related papers (2020-11-03T16:58:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.