Missing Data Imputation using Optimal Transport
- URL: http://arxiv.org/abs/2002.03860v3
- Date: Wed, 1 Jul 2020 09:16:41 GMT
- Title: Missing Data Imputation using Optimal Transport
- Authors: Boris Muzellec, Julie Josse, Claire Boyer, Marco Cuturi
- Abstract summary: We leverage optimal transport distances to quantify a criterion and turn it into a loss function to impute missing data values.
We propose practical methods to minimize these losses using end-to-end learning.
These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.
- Score: 43.14084843713895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Missing data is a crucial issue when applying machine learning algorithms to
real-world datasets. Starting from the simple assumption that two batches
extracted randomly from the same dataset should share the same distribution, we
leverage optimal transport distances to quantify that criterion and turn it
into a loss function to impute missing data values. We propose practical
methods to minimize these losses using end-to-end learning, that can exploit or
not parametric assumptions on the underlying distributions of values. We
evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR
settings. These experiments show that OT-based methods match or out-perform
state-of-the-art imputation methods, even for high percentages of missing
values.
Related papers
- Evaluation of Missing Data Analytical Techniques in Longitudinal Research: Traditional and Machine Learning Approaches [11.048092826888412]
This study utilizes Monte Carlo simulations to assess and compare the effectiveness of six analytical techniques for missing data within the growth curve modeling framework.
We investigate the influence of sample size, missing data rate, missing data mechanism, and data distribution on the accuracy and efficiency of model estimation.
arXiv Detail & Related papers (2024-06-19T20:20:30Z) - Transformed Distribution Matching for Missing Value Imputation [7.754689608872696]
Key to missing value imputation is to capture the data distribution with incomplete samples and impute the missing values accordingly.
In this paper, we propose to impute the missing values of two batches of data by transforming them into a latent space through deep invertible functions.
To learn the transformations and impute the missing values simultaneously, a simple and well-motivated algorithm is proposed.
arXiv Detail & Related papers (2023-02-20T23:44:30Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z) - FCMI: Feature Correlation based Missing Data Imputation [0.0]
We propose an efficient technique to impute the missing value in the dataset based on correlation called FCMI.
Our proposed algorithm picks the highly correlated attributes of the dataset and uses these attributes to build a regression model.
Experiments conducted on both classification and regression datasets show that the proposed imputation technique outperforms existing imputation algorithms.
arXiv Detail & Related papers (2021-06-26T13:35:33Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines.
Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z) - Deep Active Learning for Biased Datasets via Fisher Kernel
Self-Supervision [5.352699766206807]
Active learning (AL) aims to minimize labeling efforts for data-demanding deep neural networks (DNNs)
We propose a low-complexity method for feature density matching using self-supervised Fisher kernel (FK)
Our method outperforms state-of-the-art methods on MNIST, SVHN, and ImageNet classification while requiring only 1/10th of processing.
arXiv Detail & Related papers (2020-03-01T03:56:32Z) - On the consistency of supervised learning with missing values [15.666860186278782]
In many application settings, the data have missing entries which make analysis challenging.
Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data.
We show that the widely-used method of imputing with a constant, such as the mean prior to learning, is consistent when missing values are not informative.
arXiv Detail & Related papers (2019-02-19T07:27:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.