Related papers: MissDiff: Training Diffusion Models on Tabular Data with Missing Values

MissDiff: Training Diffusion Models on Tabular Data with Missing Values

URL: http://arxiv.org/abs/2307.00467v1
Date: Sun, 2 Jul 2023 03:49:47 GMT
Title: MissDiff: Training Diffusion Models on Tabular Data with Missing Values
Authors: Yidong Ouyang, Liyan Xie, Chongxuan Li, Guang Cheng
Abstract summary: This work presents a unified and principled diffusion-based framework for learning from data with missing values. We first observe that the widely adopted "impute-then-generate" pipeline may lead to a biased learning objective. We prove the proposed method is consistent in learning the score of data distributions, and the proposed training objective serves as an upper bound for the negative likelihood in certain cases.
Score: 29.894691645801597
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The diffusion model has shown remarkable performance in modeling data distributions and synthesizing data. However, the vanilla diffusion model requires complete or fully observed data for training. Incomplete data is a common issue in various real-world applications, including healthcare and finance, particularly when dealing with tabular datasets. This work presents a unified and principled diffusion-based framework for learning from data with missing values under various missing mechanisms. We first observe that the widely adopted "impute-then-generate" pipeline may lead to a biased learning objective. Then we propose to mask the regression loss of Denoising Score Matching in the training phase. We prove the proposed method is consistent in learning the score of data distributions, and the proposed training objective serves as an upper bound for the negative likelihood in certain cases. The proposed framework is evaluated on multiple tabular datasets using realistic and efficacious metrics and is demonstrated to outperform state-of-the-art diffusion model on tabular data with "impute-then-generate" pipeline by a large margin.

Related papers

MissDDIM: Deterministic and Efficient Conditional Diffusion for Tabular Data Imputation [2.124791625488617]
We present MissDDIM, a conditional diffusion framework that adapts Denoising Diffusion Implicit Models (DDIM) for tabular imputation.<n>While sampling enables diverse completions, it also introduces output variability that complicates downstream processing.
arXiv Detail & Related papers (2025-08-05T04:55:26Z)
TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all mixed-type distributions of tabular data in one model. Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data. TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z)
Constrained Diffusion Models via Dual Training [80.03953599062365]
Diffusion processes are prone to generating samples that reflect biases in a training dataset. We develop constrained diffusion models by imposing diffusion constraints based on desired distributions. We show that our constrained diffusion models generate new data from a mixture data distribution that achieves the optimal trade-off among objective and constraints.
arXiv Detail & Related papers (2024-08-27T14:25:42Z)
Self-Supervision Improves Diffusion Models for Tabular Data Imputation [20.871219616589986]
This paper introduces an advanced diffusion model named Self-supervised imputation Diffusion Model (SimpDM for brevity) To mitigate sensitivity to noise, we introduce a self-supervised alignment mechanism that aims to regularize the model, ensuring consistent and stable imputation predictions. We also introduce a carefully devised state-dependent data augmentation strategy within SimpDM, enhancing the robustness of the diffusion model when dealing with limited data.
arXiv Detail & Related papers (2024-07-25T13:06:30Z)
Amortizing intractable inference in diffusion models for vision, language, and control [89.65631572949702]
This paper studies amortized sampling of the posterior over data, $mathbfxsim prm post(mathbfx)propto p(mathbfx)r(mathbfx)$, in a model that consists of a diffusion generative model prior $p(mathbfx)$ and a black-box constraint or function $r(mathbfx)$. We prove the correctness of a data-free learning objective, relative trajectory balance, for training a diffusion model that samples from
arXiv Detail & Related papers (2024-05-31T16:18:46Z)
DiffPuter: Empowering Diffusion Models for Missing Data Imputation [56.48119008663155]
This paper introduces DiffPuter, a tailored diffusion model combined with the Expectation-Maximization (EM) algorithm for missing data imputation.<n>Our theoretical analysis shows that DiffPuter's training step corresponds to the maximum likelihood estimation of data density.<n>Our experiments show that DiffPuter achieves an average improvement of 6.94% in MAE and 4.78% in RMSE compared to the most competitive existing method.
arXiv Detail & Related papers (2024-05-31T08:35:56Z)
Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models. We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z)
Score Approximation, Estimation and Distribution Recovery of Diffusion Models on Low-Dimensional Data [68.62134204367668]
This paper studies score approximation, estimation, and distribution recovery of diffusion models, when data are supported on an unknown low-dimensional linear subspace. We show that with a properly chosen neural network architecture, the score function can be both accurately approximated and efficiently estimated. The generated distribution based on the estimated score function captures the data geometric structures and converges to a close vicinity of the data distribution.
arXiv Detail & Related papers (2023-02-14T17:02:35Z)
Diffusion models for missing value imputation in tabular data [10.599563005836066]
Missing value imputation in machine learning is the task of estimating the missing values in the dataset accurately using available information. We propose a diffusion model approach called "Conditional Score-based Diffusion Models for Tabular data" (CSDI_T) To effectively handle categorical variables and numerical variables simultaneously, we investigate three techniques: one-hot encoding, analog bits encoding, and feature tokenization.
arXiv Detail & Related papers (2022-10-31T08:13:26Z)
Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis. We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z)
Learn from Unpaired Data for Image Restoration: A Variational Bayes Approach [18.007258270845107]
We propose LUD-VAE, a deep generative method to learn the joint probability density function from data sampled from marginal distributions. We apply our method to real-world image denoising and super-resolution tasks and train the models using the synthetic data generated by the LUD-VAE.
arXiv Detail & Related papers (2022-04-21T13:27:17Z)
Training Deep Normalizing Flow Models in Highly Incomplete Data Scenarios with Prior Regularization [13.985534521589257]
We propose a novel framework to facilitate the learning of data distributions in high paucity scenarios. The proposed framework naturally stems from posing the process of learning from incomplete data as a joint optimization task.
arXiv Detail & Related papers (2021-04-03T20:57:57Z)
Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines. Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.