Deep Imputation of Missing Values in Time Series Health Data: A Review
with Benchmarking
- URL: http://arxiv.org/abs/2302.10902v2
- Date: Tue, 16 May 2023 16:56:04 GMT
- Title: Deep Imputation of Missing Values in Time Series Health Data: A Review
with Benchmarking
- Authors: Maksims Kazijevs and Manar D. Samad
- Abstract summary: This survey performs six data-centric experiments to benchmark state-of-the-art deep imputation methods on five time series health data sets.
Deep learning methods that jointly perform cross-sectional (across variables) and longitudinal (across time) imputations of missing values in time series data yield statistically better data quality than traditional imputation methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The imputation of missing values in multivariate time series (MTS) data is
critical in ensuring data quality and producing reliable data-driven predictive
models. Apart from many statistical approaches, a few recent studies have
proposed state-of-the-art deep learning methods to impute missing values in MTS
data. However, the evaluation of these deep methods is limited to one or two
data sets, low missing rates, and completely random missing value types. This
survey performs six data-centric experiments to benchmark state-of-the-art deep
imputation methods on five time series health data sets. Our extensive analysis
reveals that no single imputation method outperforms the others on all five
data sets. The imputation performance depends on data types, individual
variable statistics, missing value rates, and types. Deep learning methods that
jointly perform cross-sectional (across variables) and longitudinal (across
time) imputations of missing values in time series data yield statistically
better data quality than traditional imputation methods. Although
computationally expensive, deep learning methods are practical given the
current availability of high-performance computing resources, especially when
data quality and sample size are highly important in healthcare informatics.
Our findings highlight the importance of data-centric selection of imputation
methods to optimize data-driven predictive models.
Related papers
- Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - An End-to-End Model for Time Series Classification In the Presence of Missing Values [25.129396459385873]
Time series classification with missing data is a prevalent issue in time series analysis.
This study proposes an end-to-end neural network that unifies data imputation and representation learning within a single framework.
arXiv Detail & Related papers (2024-08-11T19:39:12Z) - ITI-IQA: a Toolbox for Heterogeneous Univariate and Multivariate Missing Data Imputation Quality Assessment [0.0]
ITI-IQA is a set of utilities designed to assess the reliability of various imputation methods.
The toolbox also includes a suite of diagnosing methods and graphical tools to check measurements.
arXiv Detail & Related papers (2024-07-16T14:26:46Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - Deep Ensembles Meets Quantile Regression: Uncertainty-aware Imputation for Time Series [45.76310830281876]
We propose Quantile Sub-Ensembles, a novel method to estimate uncertainty with ensemble of quantile-regression-based task networks.
Our method not only produces accurate imputations that is robust to high missing rates, but also is computationally efficient due to the fast training of its non-generative model.
arXiv Detail & Related papers (2023-12-03T05:52:30Z) - Development of a Neural Network-based Method for Improved Imputation of
Missing Values in Time Series Data by Repurposing DataWig [1.8719295298860394]
Missing values in time series data occur often and present obstacles to successful analysis, thus they need to be filled with alternative values, a process called imputation.
Although various approaches have been attempted for robust imputation of time series data, even the most advanced methods still face challenges.
I developed tsDataWig (time-series DataWig) by modifying DataWig, a neural network-based method that possesses the capacity to process large datasets.
Unlike the original DataWig, tsDataWig can directly handle values of time variables and impute missing values in complex time
arXiv Detail & Related papers (2023-08-18T15:53:40Z) - Handling missing values in healthcare data: A systematic review of deep
learning-based imputation techniques [9.400097064676991]
The proper handling of missing values is critical to delivering reliable estimates and decisions.
The increasing diversity and complexity of data have led many researchers to develop deep learning (DL)-based imputation techniques.
arXiv Detail & Related papers (2022-10-15T11:11:20Z) - CSDI: Conditional Score-based Diffusion Models for Probabilistic Time
Series Imputation [107.63407690972139]
Conditional Score-based Diffusion models for Imputation (CSDI) is a novel time series imputation method that utilizes score-based diffusion models conditioned on observed data.
CSDI improves by 40-70% over existing probabilistic imputation methods on popular performance metrics.
In addition, C reduces the error by 5-20% compared to the state-of-the-art deterministic imputation methods.
arXiv Detail & Related papers (2021-07-07T22:20:24Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.