Predicting feature imputability in the absence of ground truth
- URL: http://arxiv.org/abs/2007.07052v1
- Date: Tue, 14 Jul 2020 14:24:07 GMT
- Title: Predicting feature imputability in the absence of ground truth
- Authors: Niamh McCombe, Xuemei Ding, Girijesh Prasad, David P. Finn, Stephen
Todd, Paula L. McClean, KongFatt Wong-Lin
- Abstract summary: It is difficult to evaluate whether data has been imputed accurately (lack of ground truth) in real life applications.
This paper proposes an effective and simple principal component based method for determining whether individual data features can be accurately imputed.
- Score: 2.7684432804249477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data imputation is the most popular method of dealing with missing values,
but in most real life applications, large missing data can occur and it is
difficult or impossible to evaluate whether data has been imputed accurately
(lack of ground truth). This paper addresses these issues by proposing an
effective and simple principal component based method for determining whether
individual data features can be accurately imputed - feature imputability. In
particular, we establish a strong linear relationship between principal
component loadings and feature imputability, even in the presence of extreme
missingness and lack of ground truth. This work will have important
implications in practical data imputation strategies.
Related papers
- Accuracy on the wrong line: On the pitfalls of noisy data for out-of-distribution generalisation [70.36344590967519]
We show that noisy data and nuisance features can be sufficient to shatter the Accuracy-on-the-line phenomenon.
We demonstrate this phenomenon across both synthetic and real datasets with noisy data and nuisance features.
arXiv Detail & Related papers (2024-06-27T09:57:31Z) - Iterative missing value imputation based on feature importance [6.300806721275004]
We have designed an imputation method that considers feature importance.
This algorithm iteratively performs matrix completion and feature importance learning, and specifically, matrix completion is based on a filling loss that incorporates feature importance.
The results on these datasets consistently show that the proposed method outperforms the existing five imputation algorithms.
arXiv Detail & Related papers (2023-11-14T09:03:33Z) - Conditional expectation with regularization for missing data imputation [19.254291863337347]
Missing data frequently occurs in datasets across various domains, such as medicine, sports, and finance.
We propose a new algorithm named "conditional Distribution-based Imputation of Missing Values with Regularization" (DIMV)
DIMV operates by determining the conditional distribution of a feature that has missing entries, using the information from the fully observed features as a basis.
arXiv Detail & Related papers (2023-02-02T06:59:15Z) - LARD: Large-scale Artificial Disfluency Generation [0.0]
We propose LARD, a method for generating complex and realistic artificial disfluencies with little effort.
The proposed method can handle three of the most common types of disfluencies: repetitions, replacements and restarts.
We release a new large-scale dataset with disfluencies that can be used on four different tasks.
arXiv Detail & Related papers (2022-01-13T16:02:36Z) - Understanding Memorization from the Perspective of Optimization via
Efficient Influence Estimation [54.899751055620904]
We study the phenomenon of memorization with turn-over dropout, an efficient method to estimate influence and memorization, for data with true labels (real data) and data with random labels (random data)
Our main findings are: (i) For both real data and random data, the optimization of easy examples (e.g., real data) and difficult examples (e.g., random data) are conducted by the network simultaneously, with easy ones at a higher speed; (ii) For real data, a correct difficult example in the training dataset is more informative than an easy one.
arXiv Detail & Related papers (2021-12-16T11:34:23Z) - Fairness in Missing Data Imputation [2.3605348648054463]
We conduct the first known research on fairness of missing data imputation.
By studying the performance of imputation methods in three commonly used datasets, we demonstrate that unfairness of missing value imputation widely exists.
Our results suggest that, in practice, a careful investigation of related factors can provide valuable insights on mitigating unfairness associated with missing data imputation.
arXiv Detail & Related papers (2021-10-22T18:29:17Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Accurate and Robust Feature Importance Estimation under Distribution
Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method.
We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z) - Overcoming the curse of dimensionality with Laplacian regularization in
semi-supervised learning [80.20302993614594]
We provide a statistical analysis to overcome drawbacks of Laplacian regularization.
We unveil a large body of spectral filtering methods that exhibit desirable behaviors.
We provide realistic computational guidelines in order to make our method usable with large amounts of data.
arXiv Detail & Related papers (2020-09-09T14:28:54Z) - Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines.
Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z) - Nonparametric Feature Impact and Importance [0.6123324869194193]
We give mathematical definitions of feature impact and importance, derived from partial dependence curves, that operate directly on the data.
To assess quality, we show that features ranked by these definitions are competitive with existing feature selection techniques.
arXiv Detail & Related papers (2020-06-08T17:07:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.