Related papers: Chains of Autoreplicative Random Forests for missing value imputation in high-dimensional datasets

Chains of Autoreplicative Random Forests for missing value imputation in high-dimensional datasets

URL: http://arxiv.org/abs/2301.00595v1
Date: Mon, 2 Jan 2023 10:53:52 GMT
Title: Chains of Autoreplicative Random Forests for missing value imputation in high-dimensional datasets
Authors: Ekaterina Antonenko and Jesse Read
Abstract summary: Missing values are a common problem in data science and machine learning. We consider missing value imputation as a multi-label classification problem and propose Chains of Autoreplicative Random Forests. Our algorithm effectively imputes missing values based only on information from the dataset.
Score: 1.5076964620370268
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Missing values are a common problem in data science and machine learning. Removing instances with missing values can adversely affect the quality of further data analysis. This is exacerbated when there are relatively many more features than instances, and thus the proportion of affected instances is high. Such a scenario is common in many important domains, for example, single nucleotide polymorphism (SNP) datasets provide a large number of features over a genome for a relatively small number of individuals. To preserve as much information as possible prior to modeling, a rigorous imputation scheme is acutely needed. While Denoising Autoencoders is a state-of-the-art method for imputation in high-dimensional data, they still require enough complete cases to be trained on which is often not available in real-world problems. In this paper, we consider missing value imputation as a multi-label classification problem and propose Chains of Autoreplicative Random Forests. Using multi-label Random Forests instead of neural networks works well for low-sampled data as there are fewer parameters to optimize. Experiments on several SNP datasets show that our algorithm effectively imputes missing values based only on information from the dataset and exhibits better performance than standard algorithms that do not require any additional information. In this paper, the algorithm is implemented specifically for SNP data, but it can easily be adapted for other cases of missing value imputation.

Related papers

Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters. In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z)
Minimally Supervised Learning using Topological Projections in Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs) Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU) Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z)
MISNN: Multiple Imputation via Semi-parametric Neural Networks [9.594714330925703]
Multiple imputation (MI) has been widely applied to missing value problems in biomedical, social and econometric research. We propose MISNN, a novel and efficient algorithm that incorporates feature selection for MI.
arXiv Detail & Related papers (2023-05-02T21:45:36Z)
Transformed Distribution Matching for Missing Value Imputation [7.754689608872696]
Key to missing value imputation is to capture the data distribution with incomplete samples and impute the missing values accordingly. In this paper, we propose to impute the missing values of two batches of data by transforming them into a latent space through deep invertible functions. To learn the transformations and impute the missing values simultaneously, a simple and well-motivated algorithm is proposed.
arXiv Detail & Related papers (2023-02-20T23:44:30Z)
Learning to Bound Counterfactual Inference in Structural Causal Models from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm. The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources. It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z)
Equivariance Allows Handling Multiple Nuisance Variables When Analyzing Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution. We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z)
Minimax rate of consistency for linear models with missing values [0.0]
Missing values arise in most real-world data sets due to the aggregation of multiple sources and intrinsically missing information (sensor failure, unanswered questions in surveys...). In this paper, we focus on the extensively-studied linear models, but in presence of missing values, which turns out to be quite a challenging task. This eventually requires to solve a number of learning tasks, exponential in the number of input features, which makes predictions impossible for current real-world datasets.
arXiv Detail & Related papers (2022-02-03T08:45:34Z)
CvS: Classification via Segmentation For Small Datasets [52.821178654631254]
This paper presents CvS, a cost-effective classifier for small datasets that derives the classification labels from predicting the segmentation maps. We evaluate the effectiveness of our framework on diverse problems showing that CvS is able to achieve much higher classification results compared to previous methods when given only a handful of examples.
arXiv Detail & Related papers (2021-10-29T18:41:15Z)
Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization. We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z)
IFGAN: Missing Value Imputation using Feature-specific Generative Adversarial Networks [14.714106979097222]
We propose IFGAN, a missing value imputation algorithm based on Feature-specific Generative Adversarial Networks (GAN) A feature-specific generator is trained to impute missing values, while a discriminator is expected to distinguish the imputed values from observed ones. We empirically show on several real-life datasets that IFGAN outperforms current state-of-the-art algorithm under various missing conditions.
arXiv Detail & Related papers (2020-12-23T10:14:35Z)
Establishing strong imputation performance of a denoising autoencoder in a wide range of missing data problems [0.0]
We develop a consistent framework for both training and imputation. We benchmarked the results against state-of-the-art imputation methods. The developed autoencoder obtained the smallest error for all ranges of initial data corruption.
arXiv Detail & Related papers (2020-04-06T12:00:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.