SEMRes-DDPM: Residual Network Based Diffusion Modelling Applied to
Imbalanced Data
- URL: http://arxiv.org/abs/2403.05918v2
- Date: Tue, 12 Mar 2024 02:45:48 GMT
- Title: SEMRes-DDPM: Residual Network Based Diffusion Modelling Applied to
Imbalanced Data
- Authors: Ming Zheng, Yang Yang, Zhi-Hang Zhao, Shan-Chao Gan, Yang Chen, Si-Kai
Ni and Yang Lu
- Abstract summary: In the field of data mining and machine learning, commonly used classification models cannot effectively learn in unbalanced data.
Most of the classical oversampling methods are based on the SMOTE technique, which only focuses on the local information of the data.
We propose a novel oversampling method SEMRes-DDPM.
- Score: 9.969882349165745
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the field of data mining and machine learning, commonly used
classification models cannot effectively learn in unbalanced data. In order to
balance the data distribution before model training, oversampling methods are
often used to generate data for a small number of classes to solve the problem
of classifying unbalanced data. Most of the classical oversampling methods are
based on the SMOTE technique, which only focuses on the local information of
the data, and therefore the generated data may have the problem of not being
realistic enough. In the current oversampling methods based on generative
networks, the methods based on GANs can capture the true distribution of data,
but there is the problem of pattern collapse and training instability in
training; in the oversampling methods based on denoising diffusion probability
models, the neural network of the inverse diffusion process using the U-Net is
not applicable to tabular data, and although the MLP can be used to replace the
U-Net, the problem exists due to the simplicity of the structure and the poor
effect of removing noise. problem of poor noise removal. In order to overcome
the above problems, we propose a novel oversampling method SEMRes-DDPM.In the
SEMRes-DDPM backward diffusion process, a new neural network structure
SEMST-ResNet is used, which is suitable for tabular data and has good noise
removal effect, and it can generate tabular data with higher quality.
Experiments show that the SEMResNet network removes noise better than MLP;
SEMRes-DDPM generates data distributions that are closer to the real data
distributions than TabDDPM with CWGAN-GP; on 20 real unbalanced tabular
datasets with 9 classification models, SEMRes-DDPM improves the quality of the
generated tabular data in terms of three evaluation metrics (F1, G-mean, AUC)
with better classification performance than other SOTA oversampling methods.
Related papers
- DiffImpute: Tabular Data Imputation With Denoising Diffusion Probabilistic Model [9.908561639396273]
We propose DiffImpute, a novel Denoising Diffusion Probabilistic Model (DDPM)
It produces credible imputations for missing entries without undermining the authenticity of the existing data.
It can be applied to various settings of Missing Completely At Random (MCAR) and Missing At Random (MAR)
arXiv Detail & Related papers (2024-03-20T08:45:31Z) - Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Classification Diffusion Models: Revitalizing Density Ratio Estimation [21.264497139730473]
$textitclassification diffusion models$ (CDMs) is a DRE-based generative method that adopts the formalism of denoising diffusion models.
Our method is the first DRE-based technique that can successfully generate images beyond the MNIST dataset.
arXiv Detail & Related papers (2024-02-15T16:49:42Z) - UDPM: Upsampling Diffusion Probabilistic Models [33.51145642279836]
Denoising Diffusion Probabilistic Models (DDPM) have recently gained significant attention.
DDPMs generate high-quality samples from complex data distributions by defining an inverse process.
Unlike generative adversarial networks (GANs), the latent space of diffusion models is less interpretable.
In this work, we propose to generalize the denoising diffusion process into an Upsampling Diffusion Probabilistic Model (UDPM)
arXiv Detail & Related papers (2023-05-25T17:25:14Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - MAPS: A Noise-Robust Progressive Learning Approach for Source-Free
Domain Adaptive Keypoint Detection [76.97324120775475]
Cross-domain keypoint detection methods always require accessing the source data during adaptation.
This paper considers source-free domain adaptive keypoint detection, where only the well-trained source model is provided to the target domain.
arXiv Detail & Related papers (2023-02-09T12:06:08Z) - Denoising diffusion models for out-of-distribution detection [2.113925122479677]
We exploit the view of denoising probabilistic diffusion models (DDPM) as denoising autoencoders.
We use DDPMs to reconstruct an input that has been noised to a range of noise levels, and use the resulting multi-dimensional reconstruction error to classify out-of-distribution inputs.
arXiv Detail & Related papers (2022-11-14T20:35:11Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Effective Class-Imbalance learning based on SMOTE and Convolutional
Neural Networks [0.1074267520911262]
Imbalanced Data (ID) is a problem that deters Machine Learning (ML) models for achieving satisfactory results.
In this paper, we investigate the effectiveness of methods based on Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs)
In order to achieve reliable results, we conducted our experiments 100 times with randomly shuffled data distributions.
arXiv Detail & Related papers (2022-09-01T07:42:16Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.