In-Database Data Imputation
- URL: http://arxiv.org/abs/2401.03359v1
- Date: Sun, 7 Jan 2024 01:57:41 GMT
- Title: In-Database Data Imputation
- Authors: Massimo Perini, Milos Nikolic
- Abstract summary: Missing data is a widespread problem in many domains, creating challenges in data analysis and decision making.
Traditional techniques for dealing with missing data, such as excluding incomplete records or imputing simple estimates, are computationally efficient but may introduce bias and disrupt variable relationships.
Model-based imputation techniques offer a more robust solution that preserves the variability and relationships in the data, but they demand significantly more computation time.
This work enables efficient, high-quality, and scalable data imputation within a database system using the widely used MICE method.
- Score: 0.6157028677798809
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Missing data is a widespread problem in many domains, creating challenges in
data analysis and decision making. Traditional techniques for dealing with
missing data, such as excluding incomplete records or imputing simple estimates
(e.g., mean), are computationally efficient but may introduce bias and disrupt
variable relationships, leading to inaccurate analyses. Model-based imputation
techniques offer a more robust solution that preserves the variability and
relationships in the data, but they demand significantly more computation time,
limiting their applicability to small datasets.
This work enables efficient, high-quality, and scalable data imputation
within a database system using the widely used MICE method. We adapt this
method to exploit computation sharing and a ring abstraction for faster model
training. To impute both continuous and categorical values, we develop
techniques for in-database learning of stochastic linear regression and
Gaussian discriminant analysis models. Our MICE implementations in PostgreSQL
and DuckDB outperform alternative MICE implementations and model-based
imputation techniques by up to two orders of magnitude in terms of computation
time, while maintaining high imputation quality.
Related papers
- Not Another Imputation Method: A Transformer-based Model for Missing Values in Tabular Datasets [1.02138250640885]
"Not Another Imputation Method" (NAIM) is a transformer-based model designed to handle missing values without traditional imputation techniques.
NAIM employs feature-specific embeddings and a masked self-attention mechanism that effectively learns from available data.
We extensively evaluated NAIM on 5 publicly available datasets.
arXiv Detail & Related papers (2024-07-16T09:43:47Z) - Stochastic Amortization: A Unified Approach to Accelerate Feature and
Data Attribution [67.28273187033693]
We show that training a network that directly predicts the desired output, known as amortization, is inexpensive and surprisingly effective.
This approach significantly accelerates several feature attribution and data valuation methods, often yielding an order of magnitude speedup over existing approaches.
arXiv Detail & Related papers (2024-01-29T03:42:37Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Deep Ensembles Meets Quantile Regression: Uncertainty-aware Imputation
for Time Series [49.992908221544624]
Time series data often exhibit numerous missing values, which is the time series imputation task.
Previous deep learning methods have been shown to be effective for time series imputation.
We propose a non-generative time series imputation method that produces accurate imputations with inherent uncertainty.
arXiv Detail & Related papers (2023-12-03T05:52:30Z) - Stabilizing Subject Transfer in EEG Classification with Divergence
Estimation [17.924276728038304]
We propose several graphical models to describe an EEG classification task.
We identify statistical relationships that should hold true in an idealized training scenario.
We design regularization penalties to enforce these relationships in two stages.
arXiv Detail & Related papers (2023-10-12T23:06:52Z) - Diffusion models for missing value imputation in tabular data [10.599563005836066]
Missing value imputation in machine learning is the task of estimating the missing values in the dataset accurately using available information.
We propose a diffusion model approach called "Conditional Score-based Diffusion Models for Tabular data" (CSDI_T)
To effectively handle categorical variables and numerical variables simultaneously, we investigate three techniques: one-hot encoding, analog bits encoding, and feature tokenization.
arXiv Detail & Related papers (2022-10-31T08:13:26Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms [82.90843777097606]
We propose a causally-aware imputation algorithm (MIRACLE) for missing data.
MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism.
We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation.
arXiv Detail & Related papers (2021-11-04T22:38:18Z) - Multilevel Stochastic Optimization for Imputation in Massive Medical Data Records [6.711824170437793]
We apply a recently developed multi-level computational optimization approach to the problem of imputation in massive medical records.
Results show that the multi-level method significantly outperforms current approaches and is numerically robust.
arXiv Detail & Related papers (2021-10-19T01:14:08Z) - A computational study on imputation methods for missing environmental
data [0.0]
This paper focuses on databases collecting information related to the natural environment.
It investigates the performances of several missing data imputation methods and their application to the problem of missing data in environment.
We believe that the present study demonstrates the pertinence of using MF as imputation method when dealing with missing environmental data.
arXiv Detail & Related papers (2021-08-21T12:19:42Z) - A Hypergradient Approach to Robust Regression without Correspondence [85.49775273716503]
We consider a variant of regression problem, where the correspondence between input and output data is not available.
Most existing methods are only applicable when the sample size is small.
We propose a new computational framework -- ROBOT -- for the shuffled regression problem.
arXiv Detail & Related papers (2020-11-30T21:47:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.