Missing Data Imputation for Classification Problems
- URL: http://arxiv.org/abs/2002.10709v1
- Date: Tue, 25 Feb 2020 07:48:45 GMT
- Title: Missing Data Imputation for Classification Problems
- Authors: Arkopal Choudhury and Michael R. Kosorok
- Abstract summary: Imputation of missing data is a common application in various classification problems where the feature training matrix has missingness.
In this paper, we propose a novel iterative kNN imputation technique based on class weighted grey distance.
This ensures that the imputation of the training data is directed towards improving classification performance.
- Score: 1.52292571922932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imputation of missing data is a common application in various classification
problems where the feature training matrix has missingness. A widely used
solution to this imputation problem is based on the lazy learning technique,
$k$-nearest neighbor (kNN) approach. However, most of the previous work on
missing data does not take into account the presence of the class label in the
classification problem. Also, existing kNN imputation methods use variants of
Minkowski distance as a measure of distance, which does not work well with
heterogeneous data. In this paper, we propose a novel iterative kNN imputation
technique based on class weighted grey distance between the missing datum and
all the training data. Grey distance works well in heterogeneous data with
missing instances. The distance is weighted by Mutual Information (MI) which is
a measure of feature relevance between the features and the class label. This
ensures that the imputation of the training data is directed towards improving
classification performance. This class weighted grey kNN imputation algorithm
demonstrates improved performance when compared to other kNN imputation
algorithms, as well as standard imputation algorithms such as MICE and
missForest, in imputation and classification problems. These problems are based
on simulated scenarios and UCI datasets with various rates of missingness.
Related papers
- On the Performance of Imputation Techniques for Missing Values on Healthcare Datasets [0.0]
Missing values or data is one popular characteristic of real-world datasets, especially healthcare data.
This study is to compare the performance of seven imputation techniques, namely Mean imputation, Median Imputation, Last Observation carried Forward (LOCF) imputation, K-Nearest Neighbor (KNN) imputation, Interpolation imputation, Missforest imputation, and Multiple imputation by Chained Equations (MICE)
The results show that Missforest imputation performs the best followed by MICE imputation.
arXiv Detail & Related papers (2024-03-13T18:07:17Z) - Continual learning for surface defect segmentation by subnetwork
creation and selection [55.2480439325792]
We introduce a new continual (or lifelong) learning algorithm that performs segmentation tasks without undergoing catastrophic forgetting.
The method is applied to two different surface defect segmentation problems that are learned incrementally.
Our approach shows comparable results with joint training when all the training data (all defects) are seen simultaneously.
arXiv Detail & Related papers (2023-12-08T15:28:50Z) - IRTCI: Item Response Theory for Categorical Imputation [5.9952530228468754]
Several imputation techniques have been designed to replace missing data with stand in values.
The work showcased here offers a novel means for categorical imputation based on item response theory (IRT)
Analyses comparing these techniques were performed on three different datasets.
arXiv Detail & Related papers (2023-02-08T16:17:20Z) - Large-Margin Representation Learning for Texture Classification [67.94823375350433]
This paper presents a novel approach combining convolutional layers (CLs) and large-margin metric learning for training supervised models on small datasets for texture classification.
The experimental results on texture and histopathologic image datasets have shown that the proposed approach achieves competitive accuracy with lower computational cost and faster convergence when compared to equivalent CNNs.
arXiv Detail & Related papers (2022-06-17T04:07:45Z) - Principal Component Analysis based frameworks for efficient missing data
imputation algorithms [3.635056427544418]
We propose Principal Component Analysis Imputation (PCAI) to speed up the imputation process and alleviate memory issues of many available imputation techniques.
Next, we introduce PCA Imputation - Classification (PIC), an application of PCAI for classification problems with some adjustments.
We validate our approach by experiments on various scenarios, which shows that PCAI and PIC can work with various imputation algorithms.
arXiv Detail & Related papers (2022-05-30T14:47:27Z) - CvS: Classification via Segmentation For Small Datasets [52.821178654631254]
This paper presents CvS, a cost-effective classifier for small datasets that derives the classification labels from predicting the segmentation maps.
We evaluate the effectiveness of our framework on diverse problems showing that CvS is able to achieve much higher classification results compared to previous methods when given only a handful of examples.
arXiv Detail & Related papers (2021-10-29T18:41:15Z) - Riemannian classification of EEG signals with missing values [67.90148548467762]
This paper proposes two strategies to handle missing data for the classification of electroencephalograms.
The first approach estimates the covariance from imputed data with the $k$-nearest neighbors algorithm; the second relies on the observed data by leveraging the observed-data likelihood within an expectation-maximization algorithm.
As results show, the proposed strategies perform better than the classification based on observed data and allow to keep a high accuracy even when the missing data ratio increases.
arXiv Detail & Related papers (2021-10-19T14:24:50Z) - Self-Trained One-class Classification for Unsupervised Anomaly Detection [56.35424872736276]
Anomaly detection (AD) has various applications across domains, from manufacturing to healthcare.
In this work, we focus on unsupervised AD problems whose entire training data are unlabeled and may contain both normal and anomalous samples.
To tackle this problem, we build a robust one-class classification framework via data refinement.
We show that our method outperforms state-of-the-art one-class classification method by 6.3 AUC and 12.5 average precision.
arXiv Detail & Related papers (2021-06-11T01:36:08Z) - KNN Classification with One-step Computation [10.381276986079865]
A one-step computation is proposed to replace the lazy part of KNN classification.
The proposed approach is experimentally evaluated, and demonstrated that the one-step KNN classification is efficient and promising.
arXiv Detail & Related papers (2020-12-09T13:34:42Z) - Robustness to Missing Features using Hierarchical Clustering with Split
Neural Networks [39.29536042476913]
We propose a simple yet effective approach that clusters similar input features together using hierarchical clustering.
We evaluate this approach on a series of benchmark datasets and show promising improvements even with simple imputation techniques.
arXiv Detail & Related papers (2020-11-19T00:35:08Z) - Theoretical Insights Into Multiclass Classification: A High-dimensional
Asymptotic View [82.80085730891126]
We provide the first modernally precise analysis of linear multiclass classification.
Our analysis reveals that the classification accuracy is highly distribution-dependent.
The insights gained may pave the way for a precise understanding of other classification algorithms.
arXiv Detail & Related papers (2020-11-16T05:17:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.