MissHDD: Hybrid Deterministic Diffusion for Hetrogeneous Incomplete Data Imputation
- URL: http://arxiv.org/abs/2511.14543v1
- Date: Tue, 18 Nov 2025 14:44:49 GMT
- Title: MissHDD: Hybrid Deterministic Diffusion for Hetrogeneous Incomplete Data Imputation
- Authors: Youran Zhou, Mohamed Reda Bouadjenek, Sunil Aryal,
- Abstract summary: We propose a hybrid deterministic diffusion framework that separates heterogeneous features into two complementary generative channels.<n>A continuous DDIM-based channel provides efficient and stable deterministic denoising for numerical variables.<n>A discrete latent-path diffusion channel, inspired by loopholing-based discrete diffusion, models categorical and discrete features without leaving their valid sample.<n>The two channels are trained under a unified conditional imputation objective, enabling coherent reconstruction of mixed-type incomplete data.
- Score: 4.935498694293104
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Incomplete data are common in real-world tabular applications, where numerical, categorical, and discrete attributes coexist within a single dataset. This heterogeneous structure presents significant challenges for existing diffusion-based imputation models, which typically assume a homogeneous feature space and rely on stochastic denoising trajectories. Such assumptions make it difficult to maintain conditional consistency, and they often lead to information collapse for categorical variables or instability when numerical variables require deterministic updates. These limitations indicate that a single diffusion process is insufficient for mixed-type tabular imputation. We propose a hybrid deterministic diffusion framework that separates heterogeneous features into two complementary generative channels. A continuous DDIM-based channel provides efficient and stable deterministic denoising for numerical variables, while a discrete latent-path diffusion channel, inspired by loopholing-based discrete diffusion, models categorical and discrete features without leaving their valid sample manifolds. The two channels are trained under a unified conditional imputation objective, enabling coherent reconstruction of mixed-type incomplete data. Extensive experiments on multiple real-world datasets show that the proposed framework achieves higher imputation accuracy, more stable sampling trajectories, and improved robustness across MCAR, MAR, and MNAR settings compared with existing diffusion-based and classical methods. These results demonstrate the importance of structure-aware diffusion processes for advancing deep learning approaches to incomplete tabular data.
Related papers
- Authentic Discrete Diffusion Model [72.31371542619121]
Authentic Discrete Diffusion (ADD) framework redefines prior pseudo-discrete approaches.<n>ADD reformulates the diffusion input by directly using float-encoded one-hot class data.<n> experiments demonstrate that ADD achieves superior performance on classification tasks compared to the baseline.
arXiv Detail & Related papers (2025-10-01T15:51:10Z) - MissDDIM: Deterministic and Efficient Conditional Diffusion for Tabular Data Imputation [2.124791625488617]
We present MissDDIM, a conditional diffusion framework that adapts Denoising Diffusion Implicit Models (DDIM) for tabular imputation.<n>While sampling enables diverse completions, it also introduces output variability that complicates downstream processing.
arXiv Detail & Related papers (2025-08-05T04:55:26Z) - Interleaved Gibbs Diffusion: Generating Discrete-Continuous Data with Implicit Constraints [30.624303845550575]
Interleaved Gibbs Diffusion (IGD) is a novel generative modeling framework for discrete-continuous data.<n>IGD generalizes discrete time Gibbs sampling type Markov chain for the case of discrete-continuous generation.<n>It achieves state-of-the-art results without relying on domain-specific inductive biases.
arXiv Detail & Related papers (2025-02-19T05:51:24Z) - Continuous Diffusion Model for Language Modeling [64.7425225935854]
Existing continuous diffusion models for discrete data underperform compared to discrete methods.<n>We propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution.<n>Our method outperforms existing discrete diffusion models and approaches the performance of autoregressive models.
arXiv Detail & Related papers (2025-02-17T08:54:29Z) - TabDiff: a Mixed-type Diffusion Model for Tabular Data Generation [91.50296404732902]
We introduce TabDiff, a joint diffusion framework that models all mixed-type distributions of tabular data in one model.<n>Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data.<n>TabDiff achieves superior average performance over existing competitive baselines, with up to $22.5%$ improvement over the state-of-the-art model on pair-wise column correlation estimations.
arXiv Detail & Related papers (2024-10-27T22:58:47Z) - Latent Space Score-based Diffusion Model for Probabilistic Multivariate Time Series Imputation [6.9295879301090535]
We propose the Latent Space Score-Based Diffusion Model (LSSDM) for probabilistic time series imputation.
LSSDM achieves superior imputation performance while also providing a better explanation and uncertainty analysis of the imputation mechanism.
arXiv Detail & Related papers (2024-09-13T15:32:26Z) - Constrained Diffusion Models via Dual Training [80.03953599062365]
Diffusion processes are prone to generating samples that reflect biases in a training dataset.
We develop constrained diffusion models by imposing diffusion constraints based on desired distributions.
We show that our constrained diffusion models generate new data from a mixture data distribution that achieves the optimal trade-off among objective and constraints.
arXiv Detail & Related papers (2024-08-27T14:25:42Z) - DiffPuter: Empowering Diffusion Models for Missing Data Imputation [56.48119008663155]
This paper introduces DiffPuter, a tailored diffusion model combined with the Expectation-Maximization (EM) algorithm for missing data imputation.<n>Our theoretical analysis shows that DiffPuter's training step corresponds to the maximum likelihood estimation of data density.<n>Our experiments show that DiffPuter achieves an average improvement of 6.94% in MAE and 4.78% in RMSE compared to the most competitive existing method.
arXiv Detail & Related papers (2024-05-31T08:35:56Z) - Generative inpainting of incomplete Euclidean distance matrices of trajectories generated by a fractional Brownian motion [46.1232919707345]
Fractional Brownian motion (fBm) features both randomness and strong scale-free correlations.
Here we examine a zoo of diffusion-based inpainting methods on a specific dataset of corrupted images.
We find that the conditional diffusion generation readily reproduces the built-in correlations of fBm paths in different memory regimes.
arXiv Detail & Related papers (2024-04-10T14:22:16Z) - Uncertainty-Based Extensible Codebook for Discrete Federated Learning in Heterogeneous Data Silos [11.443755718706562]
Federated learning, aimed at leveraging vast distributed datasets, confronts a crucial challenge: the heterogeneity of data across different silos.<n>We propose an innovative yet straightforward iterative framework, termed emphUncertainty-Based Extensible-Codebook Federated Learning (UEFL).<n>This framework dynamically maps latent features to trainable discrete vectors, assesses the uncertainty, and specifically extends the discretization dictionary or codebook for silos exhibiting high uncertainty.
arXiv Detail & Related papers (2024-02-29T06:13:10Z) - Score-based Continuous-time Discrete Diffusion Models [102.65769839899315]
We extend diffusion models to discrete variables by introducing a Markov jump process where the reverse process denoises via a continuous-time Markov chain.
We show that an unbiased estimator can be obtained via simple matching the conditional marginal distributions.
We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.
arXiv Detail & Related papers (2022-11-30T05:33:29Z) - Diffusion-GAN: Training GANs with Diffusion [135.24433011977874]
Generative adversarial networks (GANs) are challenging to train stably.
We propose Diffusion-GAN, a novel GAN framework that leverages a forward diffusion chain to generate instance noise.
We show that Diffusion-GAN can produce more realistic images with higher stability and data efficiency than state-of-the-art GANs.
arXiv Detail & Related papers (2022-06-05T20:45:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.