Related papers: Missing Data Imputation using Optimal Transport

Missing Data Imputation using Optimal Transport

URL: http://arxiv.org/abs/2002.03860v3
Date: Wed, 1 Jul 2020 09:16:41 GMT
Title: Missing Data Imputation using Optimal Transport
Authors: Boris Muzellec, Julie Josse, Claire Boyer, Marco Cuturi
Abstract summary: We leverage optimal transport distances to quantify a criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.
Score: 43.14084843713895
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Missing data is a crucial issue when applying machine learning algorithms to real-world datasets. Starting from the simple assumption that two batches extracted randomly from the same dataset should share the same distribution, we leverage optimal transport distances to quantify that criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning, that can exploit or not parametric assumptions on the underlying distributions of values. We evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.

Related papers

DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
Evaluation of Missing Data Analytical Techniques in Longitudinal Research: Traditional and Machine Learning Approaches [11.048092826888412]
This study utilizes Monte Carlo simulations to assess and compare the effectiveness of six analytical techniques for missing data within the growth curve modeling framework. We investigate the influence of sample size, missing data rate, missing data mechanism, and data distribution on the accuracy and efficiency of model estimation.
arXiv Detail & Related papers (2024-06-19T20:20:30Z)
Transformed Distribution Matching for Missing Value Imputation [7.754689608872696]
Key to missing value imputation is to capture the data distribution with incomplete samples and impute the missing values accordingly. In this paper, we propose to impute the missing values of two batches of data by transforming them into a latent space through deep invertible functions. To learn the transformations and impute the missing values simultaneously, a simple and well-motivated algorithm is proposed.
arXiv Detail & Related papers (2023-02-20T23:44:30Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation. We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation. Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z)
FCMI: Feature Correlation based Missing Data Imputation [0.0]
We propose an efficient technique to impute the missing value in the dataset based on correlation called FCMI. Our proposed algorithm picks the highly correlated attributes of the dataset and uses these attributes to build a regression model. Experiments conducted on both classification and regression datasets show that the proposed imputation technique outperforms existing imputation algorithms.
arXiv Detail & Related papers (2021-06-26T13:35:33Z)
Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning. Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch. ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z)
Evaluating representations by the complexity of learning low-loss predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task. We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z)
Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines. Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z)
Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision [5.352699766206807]
Active learning (AL) aims to minimize labeling efforts for data-demanding deep neural networks (DNNs) We propose a low-complexity method for feature density matching using self-supervised Fisher kernel (FK) Our method outperforms state-of-the-art methods on MNIST, SVHN, and ImageNet classification while requiring only 1/10th of processing.
arXiv Detail & Related papers (2020-03-01T03:56:32Z)
On the consistency of supervised learning with missing values [15.666860186278782]
In many application settings, the data have missing entries which make analysis challenging. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show that the widely-used method of imputing with a constant, such as the mean prior to learning, is consistent when missing values are not informative.
arXiv Detail & Related papers (2019-02-19T07:27:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.