Numerical Data Imputation for Multimodal Data Sets: A Probabilistic
Nearest-Neighbor Kernel Density Approach
- URL: http://arxiv.org/abs/2306.16906v2
- Date: Mon, 10 Jul 2023 05:35:10 GMT
- Title: Numerical Data Imputation for Multimodal Data Sets: A Probabilistic
Nearest-Neighbor Kernel Density Approach
- Authors: Florian Lalande and Kenji Doya
- Abstract summary: We introduce a data imputation method combining nearest neighbor estimation ($k$NN) and density estimation with Gaussian kernels (KDE)
We show that our method can cope with complex original data structure, yields lower data imputation errors, and provides probabilistic estimates with higher likelihood than current methods.
- Score: 2.750124853532831
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Numerical data imputation algorithms replace missing values by estimates to
leverage incomplete data sets. Current imputation methods seek to minimize the
error between the unobserved ground truth and the imputed values. But this
strategy can create artifacts leading to poor imputation in the presence of
multimodal or complex distributions. To tackle this problem, we introduce the
$k$NN$\times$KDE algorithm: a data imputation method combining nearest neighbor
estimation ($k$NN) and density estimation with Gaussian kernels (KDE). We
compare our method with previous data imputation methods using artificial and
real-world data with different data missing scenarios and various data missing
rates, and show that our method can cope with complex original data structure,
yields lower data imputation errors, and provides probabilistic estimates with
higher likelihood than current methods. We release the code in open-source for
the community: https://github.com/DeltaFloflo/knnxkde
Related papers
- Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [65.8915778873691]
conditional distributions is a central problem in machine learning.
We propose a new learning paradigm that integrates both paired and unpaired data.
Our approach also connects intriguingly with inverse entropic optimal transport (OT)
arXiv Detail & Related papers (2024-10-03T16:12:59Z) - Adaptive $k$-nearest neighbor classifier based on the local estimation of the shape operator [49.87315310656657]
We introduce a new adaptive $k$-nearest neighbours ($kK$-NN) algorithm that explores the local curvature at a sample to adaptively defining the neighborhood size.
Results on many real-world datasets indicate that the new $kK$-NN algorithm yields superior balanced accuracy compared to the established $k$-NN method.
arXiv Detail & Related papers (2024-09-08T13:08:45Z) - Learning to Bound Counterfactual Inference in Structural Causal Models
from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm.
The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources.
It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z) - A Robust and Flexible EM Algorithm for Mixtures of Elliptical
Distributions with Missing Data [71.9573352891936]
This paper tackles the problem of missing data imputation for noisy and non-Gaussian data.
A new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data.
Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data.
arXiv Detail & Related papers (2022-01-28T10:01:37Z) - RIFLE: Imputation and Robust Inference from Low Order Marginals [10.082738539201804]
We develop a statistical inference framework for regression and classification in the presence of missing data without imputation.
Our framework, RIFLE, estimates low-order moments of the underlying data distribution with corresponding confidence intervals to learn a distributionally robust model.
Our experiments demonstrate that RIFLE outperforms other benchmark algorithms when the percentage of missing values is high and/or when the number of data points is relatively small.
arXiv Detail & Related papers (2021-09-01T23:17:30Z) - DPER: Efficient Parameter Estimation for Randomly Missing Data [0.24466725954625884]
We propose novel algorithms to find the maximum likelihood estimates (MLEs) for a one-class/multiple-class randomly missing data set.
Our algorithms do not require multiple iterations through the data, thus promising to be less time-consuming than other methods.
arXiv Detail & Related papers (2021-06-06T16:37:48Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z) - Online Missing Value Imputation and Change Point Detection with the
Gaussian Copula [21.26330349034669]
Missing value imputation is crucial for real-world data science.
We develop an online imputation algorithm for mixed data using the Gaussian copula.
arXiv Detail & Related papers (2020-09-25T16:27:47Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - Matrix Completion with Quantified Uncertainty through Low Rank Gaussian
Copula [30.84155327760468]
This paper proposes a framework for missing value imputation with quantified uncertainty.
The time required to fit the model scales linearly with the number of rows and the number of columns in the dataset.
Empirical results show the method yields state-of-the-art imputation accuracy across a wide range of data types.
arXiv Detail & Related papers (2020-06-18T19:51:42Z) - Establishing strong imputation performance of a denoising autoencoder in
a wide range of missing data problems [0.0]
We develop a consistent framework for both training and imputation.
We benchmarked the results against state-of-the-art imputation methods.
The developed autoencoder obtained the smallest error for all ranges of initial data corruption.
arXiv Detail & Related papers (2020-04-06T12:00:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.