M$^3$-Impute: Mask-guided Representation Learning for Missing Value Imputation
- URL: http://arxiv.org/abs/2410.08794v1
- Date: Fri, 11 Oct 2024 13:25:32 GMT
- Title: M$^3$-Impute: Mask-guided Representation Learning for Missing Value Imputation
- Authors: Zhongyi Yu, Zhenghao Wu, Shuhan Zhong, Weifeng Su, S. -H. Gary Chan, Chul-Ho Lee, Weipeng Zhuo,
- Abstract summary: M$3$-Impute aims to explicitly leverage the missingness information and such correlations with novel masking schemes.
Experiment results show the effectiveness of M$3$-Impute by achieving 20 best and 4 second-best MAE scores on average.
- Score: 12.174699459648842
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Missing values are a common problem that poses significant challenges to data analysis and machine learning. This problem necessitates the development of an effective imputation method to fill in the missing values accurately, thereby enhancing the overall quality and utility of the datasets. Existing imputation methods, however, fall short of explicitly considering the `missingness' information in the data during the embedding initialization stage and modeling the entangled feature and sample correlations during the learning process, thus leading to inferior performance. We propose M$^3$-Impute, which aims to explicitly leverage the missingness information and such correlations with novel masking schemes. M$^3$-Impute first models the data as a bipartite graph and uses a graph neural network to learn node embeddings, where the refined embedding initialization process directly incorporates the missingness information. They are then optimized through M$^3$-Impute's novel feature correlation unit (FRU) and sample correlation unit (SRU) that effectively captures feature and sample correlations for imputation. Experiment results on 25 benchmark datasets under three different missingness settings show the effectiveness of M$^3$-Impute by achieving 20 best and 4 second-best MAE scores on average.
Related papers
- SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training [12.745160748376794]
We propose a soft deduplication method that maintains dataset integrity while selectively reducing the sampling weight of data with high commonness.
Central to our approach is the concept of "data commonness", a metric we introduce to quantify the degree of duplication.
Empirical analysis shows that this method significantly improves training efficiency, achieving comparable perplexity scores with at least a 26% reduction in required training steps.
arXiv Detail & Related papers (2024-07-09T08:26:39Z) - Imputation with Inter-Series Information from Prototypes for Irregular
Sampled Time Series [44.12943317707651]
We propose PRIME, a Prototype Recurrent Imputation ModEl, which integrates both intra-series and inter-series information for imputing missing values in irregularly sampled time series.
Our framework comprises a prototype memory module for learning inter-series information, a bidirectional gated recurrent unit utilizing prototype information for imputation, and an attentive refinement module for adjusting imputations.
arXiv Detail & Related papers (2024-01-14T10:51:30Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Debiasing Multimodal Models via Causal Information Minimization [65.23982806840182]
We study bias arising from confounders in a causal graph for multimodal data.
Robust predictive features contain diverse information that helps a model generalize to out-of-distribution data.
We use these features as confounder representations and use them via methods motivated by causal theory to remove bias from models.
arXiv Detail & Related papers (2023-11-28T16:46:14Z) - TRIAGE: Characterizing and auditing training data for improved
regression [80.11415390605215]
We introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors.
TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score.
We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings.
arXiv Detail & Related papers (2023-10-29T10:31:59Z) - Self-Supervised Dataset Distillation for Transfer Learning [77.4714995131992]
We propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL)
We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is textitbiased due to randomness originating from data augmentations or masking.
We empirically validate the effectiveness of our method on various applications involving transfer learning.
arXiv Detail & Related papers (2023-10-10T10:48:52Z) - Towards Better Modeling with Missing Data: A Contrastive Learning-based
Visual Analytics Perspective [7.577040836988683]
Missing data can pose a challenge for machine learning (ML) modeling.
Current approaches are categorized into feature imputation and label prediction.
This study proposes a Contrastive Learning framework to model observed data with missing values.
arXiv Detail & Related papers (2023-09-18T13:16:24Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z) - Towards a methodology for addressing missingness in datasets, with an
application to demographic health datasets [0.0]
We present a methodology for tackling missing data problems using a combination of synthetic dataset generation, missing data imputation and deep learning methods.
Our results show that models trained on synthetic and imputed datasets could make predictions with an accuracy of $83 %$ and $80 %$ on $a) $ an unseen real dataset and $b)$ an unseen reserved synthetic test dataset.
arXiv Detail & Related papers (2022-11-05T09:02:30Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z) - ELMV: an Ensemble-Learning Approach for Analyzing Electrical Health
Records with Significant Missing Values [4.9810955364960385]
We propose a novel Ensemble-Learning for Missing Value (ELMV) framework, which introduces an effective approach to construct multiple subsets of the original EHR data with a much lower missing rate.
ELMV has been evaluated on a real-world healthcare data for critical feature identification as well as a batch of simulation data with different missing rates for outcome prediction.
arXiv Detail & Related papers (2020-06-25T06:29:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.