Imputation of Missing Data with Class Imbalance using Conditional
Generative Adversarial Networks
- URL: http://arxiv.org/abs/2012.00220v1
- Date: Tue, 1 Dec 2020 02:26:54 GMT
- Title: Imputation of Missing Data with Class Imbalance using Conditional
Generative Adversarial Networks
- Authors: Saqib Ejaz Awan, Mohammed Bennamoun, Ferdous Sohel, Frank M
Sanfilippo, Girish Dwivedi
- Abstract summary: We propose a new method for imputing missing data based on its class-specific characteristics.
Our Conditional Generative Adversarial Imputation Network (CGAIN) imputes the missing data using class-specific distributions.
We tested our approach on benchmark datasets and achieved superior performance compared with the state-of-the-art and popular imputation approaches.
- Score: 24.075691766743702
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Missing data is a common problem faced with real-world datasets. Imputation
is a widely used technique to estimate the missing data. State-of-the-art
imputation approaches, such as Generative Adversarial Imputation Nets (GAIN),
model the distribution of observed data to approximate the missing values. Such
an approach usually models a single distribution for the entire dataset, which
overlooks the class-specific characteristics of the data. Class-specific
characteristics are especially useful when there is a class imbalance. We
propose a new method for imputing missing data based on its class-specific
characteristics by adapting the popular Conditional Generative Adversarial
Networks (CGAN). Our Conditional Generative Adversarial Imputation Network
(CGAIN) imputes the missing data using class-specific distributions, which can
produce the best estimates for the missing values. We tested our approach on
benchmark datasets and achieved superior performance compared with the
state-of-the-art and popular imputation approaches.
Related papers
- Classification of datasets with imputed missing values: does imputation
quality matter? [2.7646249774183]
Classifying samples in incomplete datasets is non-trivial.
We demonstrate how the commonly used measures for assessing quality are flawed.
We propose a new class of discrepancy scores which focus on how well the method recreates the overall distribution of the data.
arXiv Detail & Related papers (2022-06-16T22:58:03Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Graph-LDA: Graph Structure Priors to Improve the Accuracy in Few-Shot
Classification [6.037383467521294]
We introduce a generic model where observed class signals are supposed to be deteriorated with two sources of noise.
We derive an optimal methodology to classify such signals.
This methodology includes a single parameter, making it particularly suitable for cases where available data is scarce.
arXiv Detail & Related papers (2021-08-23T21:55:45Z) - Self-Trained One-class Classification for Unsupervised Anomaly Detection [56.35424872736276]
Anomaly detection (AD) has various applications across domains, from manufacturing to healthcare.
In this work, we focus on unsupervised AD problems whose entire training data are unlabeled and may contain both normal and anomalous samples.
To tackle this problem, we build a robust one-class classification framework via data refinement.
We show that our method outperforms state-of-the-art one-class classification method by 6.3 AUC and 12.5 average precision.
arXiv Detail & Related papers (2021-06-11T01:36:08Z) - Evaluating State-of-the-Art Classification Models Against Bayes
Optimality [106.50867011164584]
We show that we can compute the exact Bayes error of generative models learned using normalizing flows.
We use our approach to conduct a thorough investigation of state-of-the-art classification models.
arXiv Detail & Related papers (2021-06-07T06:21:20Z) - IFGAN: Missing Value Imputation using Feature-specific Generative
Adversarial Networks [14.714106979097222]
We propose IFGAN, a missing value imputation algorithm based on Feature-specific Generative Adversarial Networks (GAN)
A feature-specific generator is trained to impute missing values, while a discriminator is expected to distinguish the imputed values from observed ones.
We empirically show on several real-life datasets that IFGAN outperforms current state-of-the-art algorithm under various missing conditions.
arXiv Detail & Related papers (2020-12-23T10:14:35Z) - PC-GAIN: Pseudo-label Conditional Generative Adversarial Imputation
Networks for Incomplete Data [19.952411963344556]
PC-GAIN is a novel unsupervised missing data imputation method named PC-GAIN.
We first propose a pre-training procedure to learn potential category information contained in a subset of low-missing-rate data.
Then an auxiliary classifier is determined using the synthetic pseudo-labels.
arXiv Detail & Related papers (2020-11-16T08:08:26Z) - Extended Missing Data Imputation via GANs for Ranking Applications [5.2710726359379265]
Conditional Imputation GAN is an extended missing data imputation method based on Generative Adversarial Networks (GANs)
We prove that the optimal GAN imputation is achieved for Extended Missing At Random (EMAR) and Extended Always Missing At Random (EAMAR) mechanisms, beyond the naive MCAR.
arXiv Detail & Related papers (2020-11-04T01:15:41Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - Classify and Generate Reciprocally: Simultaneous Positive-Unlabelled
Learning and Conditional Generation with Extra Data [77.31213472792088]
The scarcity of class-labeled data is a ubiquitous bottleneck in many machine learning problems.
We address this problem by leveraging Positive-Unlabeled(PU) classification and the conditional generation with extra unlabeled data.
We present a novel training framework to jointly target both PU classification and conditional generation when exposed to extra data.
arXiv Detail & Related papers (2020-06-14T08:27:40Z) - Rethinking Class-Balanced Methods for Long-Tailed Visual Recognition
from a Domain Adaptation Perspective [98.70226503904402]
Object frequency in the real world often follows a power law, leading to a mismatch between datasets with long-tailed class distributions.
We propose to augment the classic class-balanced learning by explicitly estimating the differences between the class-conditioned distributions with a meta-learning approach.
arXiv Detail & Related papers (2020-03-24T11:28:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.