Chains of Autoreplicative Random Forests for missing value imputation in
high-dimensional datasets
- URL: http://arxiv.org/abs/2301.00595v1
- Date: Mon, 2 Jan 2023 10:53:52 GMT
- Title: Chains of Autoreplicative Random Forests for missing value imputation in
high-dimensional datasets
- Authors: Ekaterina Antonenko and Jesse Read
- Abstract summary: Missing values are a common problem in data science and machine learning.
We consider missing value imputation as a multi-label classification problem and propose Chains of Autoreplicative Random Forests.
Our algorithm effectively imputes missing values based only on information from the dataset.
- Score: 1.5076964620370268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Missing values are a common problem in data science and machine learning.
Removing instances with missing values can adversely affect the quality of
further data analysis. This is exacerbated when there are relatively many more
features than instances, and thus the proportion of affected instances is high.
Such a scenario is common in many important domains, for example, single
nucleotide polymorphism (SNP) datasets provide a large number of features over
a genome for a relatively small number of individuals. To preserve as much
information as possible prior to modeling, a rigorous imputation scheme is
acutely needed. While Denoising Autoencoders is a state-of-the-art method for
imputation in high-dimensional data, they still require enough complete cases
to be trained on which is often not available in real-world problems. In this
paper, we consider missing value imputation as a multi-label classification
problem and propose Chains of Autoreplicative Random Forests. Using multi-label
Random Forests instead of neural networks works well for low-sampled data as
there are fewer parameters to optimize. Experiments on several SNP datasets
show that our algorithm effectively imputes missing values based only on
information from the dataset and exhibits better performance than standard
algorithms that do not require any additional information. In this paper, the
algorithm is implemented specifically for SNP data, but it can easily be
adapted for other cases of missing value imputation.
Related papers
- Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters.
In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - MISNN: Multiple Imputation via Semi-parametric Neural Networks [9.594714330925703]
Multiple imputation (MI) has been widely applied to missing value problems in biomedical, social and econometric research.
We propose MISNN, a novel and efficient algorithm that incorporates feature selection for MI.
arXiv Detail & Related papers (2023-05-02T21:45:36Z) - Transformed Distribution Matching for Missing Value Imputation [7.754689608872696]
Key to missing value imputation is to capture the data distribution with incomplete samples and impute the missing values accordingly.
In this paper, we propose to impute the missing values of two batches of data by transforming them into a latent space through deep invertible functions.
To learn the transformations and impute the missing values simultaneously, a simple and well-motivated algorithm is proposed.
arXiv Detail & Related papers (2023-02-20T23:44:30Z) - Learning to Bound Counterfactual Inference in Structural Causal Models
from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm.
The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources.
It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Minimax rate of consistency for linear models with missing values [0.0]
Missing values arise in most real-world data sets due to the aggregation of multiple sources and intrinsically missing information (sensor failure, unanswered questions in surveys...).
In this paper, we focus on the extensively-studied linear models, but in presence of missing values, which turns out to be quite a challenging task.
This eventually requires to solve a number of learning tasks, exponential in the number of input features, which makes predictions impossible for current real-world datasets.
arXiv Detail & Related papers (2022-02-03T08:45:34Z) - CvS: Classification via Segmentation For Small Datasets [52.821178654631254]
This paper presents CvS, a cost-effective classifier for small datasets that derives the classification labels from predicting the segmentation maps.
We evaluate the effectiveness of our framework on diverse problems showing that CvS is able to achieve much higher classification results compared to previous methods when given only a handful of examples.
arXiv Detail & Related papers (2021-10-29T18:41:15Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - IFGAN: Missing Value Imputation using Feature-specific Generative
Adversarial Networks [14.714106979097222]
We propose IFGAN, a missing value imputation algorithm based on Feature-specific Generative Adversarial Networks (GAN)
A feature-specific generator is trained to impute missing values, while a discriminator is expected to distinguish the imputed values from observed ones.
We empirically show on several real-life datasets that IFGAN outperforms current state-of-the-art algorithm under various missing conditions.
arXiv Detail & Related papers (2020-12-23T10:14:35Z) - Establishing strong imputation performance of a denoising autoencoder in
a wide range of missing data problems [0.0]
We develop a consistent framework for both training and imputation.
We benchmarked the results against state-of-the-art imputation methods.
The developed autoencoder obtained the smallest error for all ranges of initial data corruption.
arXiv Detail & Related papers (2020-04-06T12:00:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.