Related papers: Not Another Imputation Method: A Transformer-based Model for Missing Values in Tabular Datasets

Not Another Imputation Method: A Transformer-based Model for Missing Values in Tabular Datasets

URL: http://arxiv.org/abs/2407.11540v1
Date: Tue, 16 Jul 2024 09:43:47 GMT
Title: Not Another Imputation Method: A Transformer-based Model for Missing Values in Tabular Datasets
Authors: Camillo Maria Caruso, Paolo Soda, Valerio Guarrasi,
Abstract summary: "Not Another Imputation Method" (NAIM) is a transformer-based model designed to handle missing values without traditional imputation techniques. NAIM employs feature-specific embeddings and a masked self-attention mechanism that effectively learns from available data. We extensively evaluated NAIM on 5 publicly available datasets.
Score: 1.02138250640885
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Handling missing values in tabular datasets presents a significant challenge in training and testing artificial intelligence models, an issue usually addressed using imputation techniques. Here we introduce "Not Another Imputation Method" (NAIM), a novel transformer-based model specifically designed to address this issue without the need for traditional imputation techniques. NAIM employs feature-specific embeddings and a masked self-attention mechanism that effectively learns from available data, thus avoiding the necessity to impute missing values. Additionally, a novel regularization technique is introduced to enhance the model's generalization capability from incomplete data. We extensively evaluated NAIM on 5 publicly available tabular datasets, demonstrating its superior performance over 6 state-of-the-art machine learning models and 4 deep learning models, each paired with 3 different imputation techniques when necessary. The results highlight the efficacy of NAIM in improving predictive performance and resilience in the presence of missing data. To facilitate further research and practical application in handling missing data without traditional imputation methods, we made the code for NAIM available at https://github.com/cosbidev/NAIM.

Related papers

No Imputation of Missing Values In Tabular Data Classification Using Incremental Learning [0.0]
This paper proposes no imputation incremental learning (NIIL) of tabular data with varying missing value rates and types.<n>The proposed method incrementally learns partitions of overlapping feature sets while using attention masks to exclude missing values from attention scoring.<n>Experiments substantiate the robustness of NIIL against varying missing value types and rates compared to methods that involve the imputation of missing values.
arXiv Detail & Related papers (2025-04-20T13:31:49Z)
Statistically Testing Training Data for Unwanted Error Patterns using Rule-Oriented Regression [0.5831737970661137]
We provide a method to test training data for flaws, to establish a trustworthy ground-truth for a subsequent training of machine learning models. Our approach extends the abilities of conventional statistical testing by letting the test-condition'' be any condition to describe a pattern in the data. We provide an open source implementation for demonstration and experiments.
arXiv Detail & Related papers (2025-03-24T09:52:36Z)
Precision Adaptive Imputation Network : An Unified Technique for Mixed Datasets [0.0]
This study introduces the Precision Adaptive Imputation Network (PAIN), a novel algorithm designed to enhance data reconstruction.<n>PAIN employs a tri-step process that integrates statistical methods, random forests, and autoencoders, ensuring balanced accuracy and efficiency in imputation.<n>The findings highlight PAIN's superior ability to preserve data distributions and maintain analytical integrity, particularly in complex scenarios where missingness is not completely at random.
arXiv Detail & Related papers (2025-01-18T06:22:27Z)
Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest. Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z)
Combating Missing Modalities in Egocentric Videos at Test Time [92.38662956154256]
Real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues. We propose a novel approach to address this issue at test time without requiring retraining. MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time.
arXiv Detail & Related papers (2024-04-23T16:01:33Z)
In-Database Data Imputation [0.6157028677798809]
Missing data is a widespread problem in many domains, creating challenges in data analysis and decision making. Traditional techniques for dealing with missing data, such as excluding incomplete records or imputing simple estimates, are computationally efficient but may introduce bias and disrupt variable relationships. Model-based imputation techniques offer a more robust solution that preserves the variability and relationships in the data, but they demand significantly more computation time. This work enables efficient, high-quality, and scalable data imputation within a database system using the widely used MICE method.
arXiv Detail & Related papers (2024-01-07T01:57:41Z)
Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning Interference with Gradient Projection [56.292071534857946]
Recent data-privacy laws have sparked interest in machine unlearning. Challenge is to discard information about the forget'' data without altering knowledge about remaining dataset. We adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU) We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible.
arXiv Detail & Related papers (2023-12-07T07:17:24Z)
Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences. We formulate each task as a sequence-to-sequence problem and perform multi-task training. We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z)
Machine Learning Based Missing Values Imputation in Categorical Datasets [2.5611256859404983]
This research looked into the use of machine learning algorithms to fill in the gaps in categorical datasets. The emphasis was on ensemble models constructed using the Error Correction Output Codes framework. Deep learning for missing data imputation has obstacles despite these encouraging results, including the requirement for large amounts of labeled data.
arXiv Detail & Related papers (2023-06-10T03:29:48Z)
PROMISSING: Pruning Missing Values in Neural Networks [0.0]
We propose a simple and intuitive yet effective method for pruning missing values (PROMISSING) during learning and inference steps in neural networks. Our experiments show that PROMISSING results in similar prediction performance compared to various imputation techniques.
arXiv Detail & Related papers (2022-06-03T15:37:27Z)
BERT WEAVER: Using WEight AVERaging to enable lifelong learning for transformer-based models in biomedical semantic search engines [49.75878234192369]
We present WEAVER, a simple, yet efficient post-processing method that infuses old knowledge into the new model. We show that applying WEAVER in a sequential manner results in similar word embedding distributions as doing a combined training on all data at once.
arXiv Detail & Related papers (2022-02-21T10:34:41Z)
Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation. We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation. Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z)
Rank-R FNN: A Tensor-Based Learning Model for High-Order Data Classification [69.26747803963907]
Rank-R Feedforward Neural Network (FNN) is a tensor-based nonlinear learning model that imposes Canonical/Polyadic decomposition on its parameters. First, it handles inputs as multilinear arrays, bypassing the need for vectorization, and can thus fully exploit the structural information along every data dimension. We establish the universal approximation and learnability properties of Rank-R FNN, and we validate its performance on real-world hyperspectral datasets.
arXiv Detail & Related papers (2021-04-11T16:37:32Z)
Multiple Imputation with Denoising Autoencoder using Metamorphic Truth and Imputation Feedback [0.0]
We propose a Multiple Imputation model using Denoising Autoencoders to learn the internal representation of data. We use the novel mechanisms of Metamorphic Truth and Imputation Feedback to maintain statistical integrity of attributes. Our approach explores the effects of imputation on various missingness mechanisms and patterns of missing data, outperforming other methods in many standard test cases.
arXiv Detail & Related papers (2020-02-19T18:26:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.