Related papers: A computational study on imputation methods for missing environmental data

A computational study on imputation methods for missing environmental data

URL: http://arxiv.org/abs/2108.09500v1
Date: Sat, 21 Aug 2021 12:19:42 GMT
Title: A computational study on imputation methods for missing environmental data
Authors: Paul Dixneuf and Fausto Errico and Mathias Glaus
Abstract summary: This paper focuses on databases collecting information related to the natural environment. It investigates the performances of several missing data imputation methods and their application to the problem of missing data in environment. We believe that the present study demonstrates the pertinence of using MF as imputation method when dealing with missing environmental data.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Data acquisition and recording in the form of databases are routine operations. The process of collecting data, however, may experience irregularities, resulting in databases with missing data. Missing entries might alter analysis efficiency and, consequently, the associated decision-making process. This paper focuses on databases collecting information related to the natural environment. Given the broad spectrum of recorded activities, these databases typically are of mixed nature. It is therefore relevant to evaluate the performance of missing data processing methods considering this characteristic. In this paper we investigate the performances of several missing data imputation methods and their application to the problem of missing data in environment. A computational study was performed to compare the method missForest (MF) with two other imputation methods, namely Multivariate Imputation by Chained Equations (MICE) and K-Nearest Neighbors (KNN). Tests were made on 10 pretreated datasets of various types. Results revealed that MF generally outperformed MICE and KNN in terms of imputation errors, with a more pronounced performance gap for mixed typed databases where MF reduced the imputation error up to 150%, when compared to the other methods. KNN was usually the fastest method. MF was then successfully applied to a case study on Quebec wastewater treatment plants performance monitoring. We believe that the present study demonstrates the pertinence of using MF as imputation method when dealing with missing environmental data.

Related papers

Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training. We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO. As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z)
Estimating Conditional Average Treatment Effects via Sufficient Representation Learning [31.822980052107496]
This paper proposes a novel neural network approach named textbfCrossNet to learn a sufficient representation for the features, based on which we then estimate the conditional average treatment effects (CATE) Numerical simulations and empirical results demonstrate that our method outperforms the competitive approaches.
arXiv Detail & Related papers (2024-08-30T07:23:59Z)
On the Performance of Imputation Techniques for Missing Values on Healthcare Datasets [0.0]
Missing values or data is one popular characteristic of real-world datasets, especially healthcare data. This study is to compare the performance of seven imputation techniques, namely Mean imputation, Median Imputation, Last Observation carried Forward (LOCF) imputation, K-Nearest Neighbor (KNN) imputation, Interpolation imputation, Missforest imputation, and Multiple imputation by Chained Equations (MICE) The results show that Missforest imputation performs the best followed by MICE imputation.
arXiv Detail & Related papers (2024-03-13T18:07:17Z)
Physics-informed and Unsupervised Riemannian Domain Adaptation for Machine Learning on Heterogeneous EEG Datasets [53.367212596352324]
We propose an unsupervised approach leveraging EEG signal physics. We map EEG channels to fixed positions using field, source-free domain adaptation. Our method demonstrates robust performance in brain-computer interface (BCI) tasks and potential biomarker applications.
arXiv Detail & Related papers (2024-03-07T16:17:33Z)
In-Database Data Imputation [0.6157028677798809]
Missing data is a widespread problem in many domains, creating challenges in data analysis and decision making. Traditional techniques for dealing with missing data, such as excluding incomplete records or imputing simple estimates, are computationally efficient but may introduce bias and disrupt variable relationships. Model-based imputation techniques offer a more robust solution that preserves the variability and relationships in the data, but they demand significantly more computation time. This work enables efficient, high-quality, and scalable data imputation within a database system using the widely used MICE method.
arXiv Detail & Related papers (2024-01-07T01:57:41Z)
IRTCI: Item Response Theory for Categorical Imputation [5.9952530228468754]
Several imputation techniques have been designed to replace missing data with stand in values. The work showcased here offers a novel means for categorical imputation based on item response theory (IRT) Analyses comparing these techniques were performed on three different datasets.
arXiv Detail & Related papers (2023-02-08T16:17:20Z)
Multiple Imputation with Neural Network Gaussian Process for High-dimensional Incomplete Data [9.50726756006467]
Imputation is arguably the most popular method for handling missing data, though existing methods have a number of limitations. We propose two NNGP-based MI methods, namely MI-NNGP, that can apply multiple imputations for missing values from a joint (posterior predictive) distribution. The MI-NNGP methods are shown to significantly outperform existing state-of-the-art methods on synthetic and real datasets.
arXiv Detail & Related papers (2022-11-23T20:54:26Z)
Automatic Data Augmentation via Invariance-Constrained Learning [94.27081585149836]
Underlying data structures are often exploited to improve the solution of learning tasks. Data augmentation induces these symmetries during training by applying multiple transformations to the input data. This work tackles these issues by automatically adapting the data augmentation while solving the learning task.
arXiv Detail & Related papers (2022-09-29T18:11:01Z)
MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models [78.72682320019737]
We develop a general method, which we call MissDAG, to perform causal discovery from data with incomplete observations. MissDAG maximizes the expected likelihood of the visible part of observations under the expectation-maximization framework. We demonstrate the flexibility of MissDAG for incorporating various causal discovery algorithms and its efficacy through extensive simulations and real data experiments.
arXiv Detail & Related papers (2022-05-27T09:59:46Z)
To Impute or not to Impute? -- Missing Data in Treatment Effect Estimation [84.76186111434818]
We identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection. We show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not.
arXiv Detail & Related papers (2022-02-04T12:08:31Z)
MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms [82.90843777097606]
We propose a causally-aware imputation algorithm (MIRACLE) for missing data. MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism. We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation.
arXiv Detail & Related papers (2021-11-04T22:38:18Z)
Self-Trained One-class Classification for Unsupervised Anomaly Detection [56.35424872736276]
Anomaly detection (AD) has various applications across domains, from manufacturing to healthcare. In this work, we focus on unsupervised AD problems whose entire training data are unlabeled and may contain both normal and anomalous samples. To tackle this problem, we build a robust one-class classification framework via data refinement. We show that our method outperforms state-of-the-art one-class classification method by 6.3 AUC and 12.5 average precision.
arXiv Detail & Related papers (2021-06-11T01:36:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.