DiffImpute: Tabular Data Imputation With Denoising Diffusion Probabilistic Model
- URL: http://arxiv.org/abs/2403.13863v1
- Date: Wed, 20 Mar 2024 08:45:31 GMT
- Title: DiffImpute: Tabular Data Imputation With Denoising Diffusion Probabilistic Model
- Authors: Yizhu Wen, Kai Yi, Jing Ke, Yiqing Shen,
- Abstract summary: We propose DiffImpute, a novel Denoising Diffusion Probabilistic Model (DDPM)
It produces credible imputations for missing entries without undermining the authenticity of the existing data.
It can be applied to various settings of Missing Completely At Random (MCAR) and Missing At Random (MAR)
- Score: 9.908561639396273
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tabular data plays a crucial role in various domains but often suffers from missing values, thereby curtailing its potential utility. Traditional imputation techniques frequently yield suboptimal results and impose substantial computational burdens, leading to inaccuracies in subsequent modeling tasks. To address these challenges, we propose DiffImpute, a novel Denoising Diffusion Probabilistic Model (DDPM). Specifically, DiffImpute is trained on complete tabular datasets, ensuring that it can produce credible imputations for missing entries without undermining the authenticity of the existing data. Innovatively, it can be applied to various settings of Missing Completely At Random (MCAR) and Missing At Random (MAR). To effectively handle the tabular features in DDPM, we tailor four tabular denoising networks, spanning MLP, ResNet, Transformer, and U-Net. We also propose Harmonization to enhance coherence between observed and imputed data by infusing the data back and denoising them multiple times during the sampling stage. To enable efficient inference while maintaining imputation performance, we propose a refined non-Markovian sampling process that works along with Harmonization. Empirical evaluations on seven diverse datasets underscore the prowess of DiffImpute. Specifically, when paired with the Transformer as the denoising network, it consistently outperforms its competitors, boasting an average ranking of 1.7 and the most minimal standard deviation. In contrast, the next best method lags with a ranking of 2.8 and a standard deviation of 0.9. The code is available at https://github.com/Dendiiiii/DiffImpute.
Related papers
- TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all multi-modal distributions of tabular data in one model.
Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data.
TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z) - SEMRes-DDPM: Residual Network Based Diffusion Modelling Applied to
Imbalanced Data [9.969882349165745]
In the field of data mining and machine learning, commonly used classification models cannot effectively learn in unbalanced data.
Most of the classical oversampling methods are based on the SMOTE technique, which only focuses on the local information of the data.
We propose a novel oversampling method SEMRes-DDPM.
arXiv Detail & Related papers (2024-03-09T14:01:04Z) - Double Descent and Overfitting under Noisy Inputs and Distribution Shift for Linear Denoisers [3.481985817302898]
A concern about studying supervised denoising is that one might not always have noiseless training data from the test distribution.
Motivated by this, we study supervised denoising and noisy-input regression under distribution shift.
arXiv Detail & Related papers (2023-05-26T22:41:40Z) - UDPM: Upsampling Diffusion Probabilistic Models [33.51145642279836]
Denoising Diffusion Probabilistic Models (DDPM) have recently gained significant attention.
DDPMs generate high-quality samples from complex data distributions by defining an inverse process.
Unlike generative adversarial networks (GANs), the latent space of diffusion models is less interpretable.
In this work, we propose to generalize the denoising diffusion process into an Upsampling Diffusion Probabilistic Model (UDPM)
arXiv Detail & Related papers (2023-05-25T17:25:14Z) - On Calibrating Diffusion Probabilistic Models [78.75538484265292]
diffusion probabilistic models (DPMs) have achieved promising results in diverse generative tasks.
We propose a simple way for calibrating an arbitrary pretrained DPM, with which the score matching loss can be reduced and the lower bounds of model likelihood can be increased.
Our calibration method is performed only once and the resulting models can be used repeatedly for sampling.
arXiv Detail & Related papers (2023-02-21T14:14:40Z) - Confidence-based Reliable Learning under Dual Noises [46.45663546457154]
Deep neural networks (DNNs) have achieved remarkable success in a variety of computer vision tasks.
Yet, the data collected from the open world are unavoidably polluted by noise, which may significantly undermine the efficacy of the learned models.
Various attempts have been made to reliably train DNNs under data noise, but they separately account for either the noise existing in the labels or that existing in the images.
This work provides a first, unified framework for reliable learning under the joint (image, label)-noise.
arXiv Detail & Related papers (2023-02-10T07:50:34Z) - Robust Face Anti-Spoofing with Dual Probabilistic Modeling [49.14353429234298]
We propose a unified framework called Dual Probabilistic Modeling (DPM), with two dedicated modules, DPM-LQ (Label Quality aware learning) and DPM-DQ (Data Quality aware learning)
DPM-LQ is able to produce robust feature representations without overfitting to the distribution of noisy semantic labels.
DPM-DQ can eliminate data noise from False Reject' and False Accept' during inference by correcting the prediction confidence of noisy data based on its quality distribution.
arXiv Detail & Related papers (2022-04-27T03:44:18Z) - Estimating High Order Gradients of the Data Distribution by Denoising [81.24581325617552]
First order derivative of a data density can be estimated efficiently by denoising score matching.
We propose a method to directly estimate high order derivatives (scores) of a data density from samples.
arXiv Detail & Related papers (2021-11-08T18:59:23Z) - Tackling Instance-Dependent Label Noise via a Universal Probabilistic
Model [80.91927573604438]
This paper proposes a simple yet universal probabilistic model, which explicitly relates noisy labels to their instances.
Experiments on datasets with both synthetic and real-world label noise verify that the proposed method yields significant improvements on robustness.
arXiv Detail & Related papers (2021-01-14T05:43:51Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.