Related papers: Numerical Data Imputation for Multimodal Data Sets: A Probabilistic Nearest-Neighbor Kernel Density Approach

Numerical Data Imputation for Multimodal Data Sets: A Probabilistic Nearest-Neighbor Kernel Density Approach

URL: http://arxiv.org/abs/2306.16906v2
Date: Mon, 10 Jul 2023 05:35:10 GMT
Title: Numerical Data Imputation for Multimodal Data Sets: A Probabilistic Nearest-Neighbor Kernel Density Approach
Authors: Florian Lalande and Kenji Doya
Abstract summary: We introduce a data imputation method combining nearest neighbor estimation ($k$NN) and density estimation with Gaussian kernels (KDE) We show that our method can cope with complex original data structure, yields lower data imputation errors, and provides probabilistic estimates with higher likelihood than current methods.
Score: 2.750124853532831
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Numerical data imputation algorithms replace missing values by estimates to leverage incomplete data sets. Current imputation methods seek to minimize the error between the unobserved ground truth and the imputed values. But this strategy can create artifacts leading to poor imputation in the presence of multimodal or complex distributions. To tackle this problem, we introduce the $k$NN$\times$KDE algorithm: a data imputation method combining nearest neighbor estimation ($k$NN) and density estimation with Gaussian kernels (KDE). We compare our method with previous data imputation methods using artificial and real-world data with different data missing scenarios and various data missing rates, and show that our method can cope with complex original data structure, yields lower data imputation errors, and provides probabilistic estimates with higher likelihood than current methods. We release the code in open-source for the community: https://github.com/DeltaFloflo/knnxkde

Related papers

Kernel Representation and Similarity Measure for Incomplete Data [55.62595187178638]
Measuring similarity between incomplete data is a fundamental challenge in web mining, recommendation systems, and user behavior analysis.<n>Traditional approaches either discard incomplete data or perform imputation as a preprocessing step, leading to information loss and biased similarity estimates.<n>This paper presents a new similarity measure that directly computes similarity between incomplete data in kernel feature space without explicit imputation in the original space.
arXiv Detail & Related papers (2025-10-15T09:41:23Z)
Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [65.8915778873691]
conditional distributions is a central problem in machine learning. We propose a new learning paradigm that integrates both paired and unpaired data. Our approach also connects intriguingly with inverse entropic optimal transport (OT)
arXiv Detail & Related papers (2024-10-03T16:12:59Z)
Adaptive $k$-nearest neighbor classifier based on the local estimation of the shape operator [49.87315310656657]
We introduce a new adaptive $k$-nearest neighbours ($kK$-NN) algorithm that explores the local curvature at a sample to adaptively defining the neighborhood size. Results on many real-world datasets indicate that the new $kK$-NN algorithm yields superior balanced accuracy compared to the established $k$-NN method.
arXiv Detail & Related papers (2024-09-08T13:08:45Z)
Learning to Bound Counterfactual Inference in Structural Causal Models from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm. The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources. It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z)
A Robust and Flexible EM Algorithm for Mixtures of Elliptical Distributions with Missing Data [71.9573352891936]
This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data.
arXiv Detail & Related papers (2022-01-28T10:01:37Z)
RIFLE: Imputation and Robust Inference from Low Order Marginals [10.082738539201804]
We develop a statistical inference framework for regression and classification in the presence of missing data without imputation. Our framework, RIFLE, estimates low-order moments of the underlying data distribution with corresponding confidence intervals to learn a distributionally robust model. Our experiments demonstrate that RIFLE outperforms other benchmark algorithms when the percentage of missing values is high and/or when the number of data points is relatively small.
arXiv Detail & Related papers (2021-09-01T23:17:30Z)
DPER: Efficient Parameter Estimation for Randomly Missing Data [0.24466725954625884]
We propose novel algorithms to find the maximum likelihood estimates (MLEs) for a one-class/multiple-class randomly missing data set. Our algorithms do not require multiple iterations through the data, thus promising to be less time-consuming than other methods.
arXiv Detail & Related papers (2021-06-06T16:37:48Z)
List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions. Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset. A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z)
Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles. Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center. We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes. A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z)
Online Missing Value Imputation and Change Point Detection with the Gaussian Copula [21.26330349034669]
Missing value imputation is crucial for real-world data science. We develop an online imputation algorithm for mixed data using the Gaussian copula.
arXiv Detail & Related papers (2020-09-25T16:27:47Z)
Evaluating representations by the complexity of learning low-loss predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task. We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z)
Matrix Completion with Quantified Uncertainty through Low Rank Gaussian Copula [30.84155327760468]
This paper proposes a framework for missing value imputation with quantified uncertainty. The time required to fit the model scales linearly with the number of rows and the number of columns in the dataset. Empirical results show the method yields state-of-the-art imputation accuracy across a wide range of data types.
arXiv Detail & Related papers (2020-06-18T19:51:42Z)
Establishing strong imputation performance of a denoising autoencoder in a wide range of missing data problems [0.0]
We develop a consistent framework for both training and imputation. We benchmarked the results against state-of-the-art imputation methods. The developed autoencoder obtained the smallest error for all ranges of initial data corruption.
arXiv Detail & Related papers (2020-04-06T12:00:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.