Impact of Comprehensive Data Preprocessing on Predictive Modelling of COVID-19 Mortality
- URL: http://arxiv.org/abs/2408.08142v1
- Date: Thu, 15 Aug 2024 13:23:59 GMT
- Title: Impact of Comprehensive Data Preprocessing on Predictive Modelling of COVID-19 Mortality
- Authors: Sangita Das, Subhrajyoti Maji,
- Abstract summary: This study evaluates the impact of a custom data preprocessing pipeline on ten machine learning models predicting COVID-19 mortality.
Our pipeline differs from a standard preprocessing pipeline through four key steps.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate predictive models are crucial for analysing COVID-19 mortality trends. This study evaluates the impact of a custom data preprocessing pipeline on ten machine learning models predicting COVID-19 mortality using data from Our World in Data (OWID). Our pipeline differs from a standard preprocessing pipeline through four key steps. Firstly, it transforms weekly reported totals into daily updates, correcting reporting biases and providing more accurate estimates. Secondly, it uses localised outlier detection and processing to preserve data variance and enhance accuracy. Thirdly, it utilises computational dependencies among columns to ensure data consistency. Finally, it incorporates an iterative feature selection process to optimise the feature set and improve model performance. Results show a significant improvement with the custom pipeline: the MLP Regressor achieved a test RMSE of 66.556 and a test R-squared of 0.991, surpassing the DecisionTree Regressor from the standard pipeline, which had a test RMSE of 222.858 and a test R-squared of 0.817. These findings highlight the importance of tailored preprocessing techniques in enhancing predictive modelling accuracy for COVID-19 mortality. Although specific to this study, these methodologies offer valuable insights into diverse datasets and domains, improving predictive performance across various contexts.
Related papers
- Sustaining model performance for covid-19 detection from dynamic audio data: Development and evaluation of a comprehensive drift-adaptive framework [0.5679775668038152]
The COVID-19 pandemic has highlighted the need for robust diagnostic tools capable of detecting the disease from diverse and evolving data sources.
The dynamic nature of real-world data can lead to model drift, where performance degrades over time as the underlying data distribution changes.
This study aims to develop a framework that monitors model drift and employs adaptation mechanisms to mitigate performance fluctuations.
arXiv Detail & Related papers (2024-09-28T10:06:30Z) - Benchmarking Estimators for Natural Experiments: A Novel Dataset and a Doubly Robust Algorithm [12.201705893125775]
We introduce a novel natural experiment dataset obtained from an early childhood literacy nonprofit.
Applying over 20 established estimators to the dataset produces inconsistent results in evaluating the nonprofit's efficacy.
We create a benchmark to evaluate estimator accuracy using synthetic outcomes.
arXiv Detail & Related papers (2024-09-06T15:44:45Z) - A Meta-Learning Approach to Predicting Performance and Data Requirements [163.4412093478316]
We propose an approach to estimate the number of samples required for a model to reach a target performance.
We find that the power law, the de facto principle to estimate model performance, leads to large error when using a small dataset.
We introduce a novel piecewise power law (PPL) that handles the two data differently.
arXiv Detail & Related papers (2023-03-02T21:48:22Z) - Prediction of SLAM ATE Using an Ensemble Learning Regression Model and
1-D Global Pooling of Data Characterization [3.4399698738841553]
We introduce a novel method for predicting SLAM localization error based on the characterization of raw sensor inputs.
The proposed method relies on using a random forest regression model trained on 1-D global pooled features that are generated from characterized raw sensor data.
The paper also studies the impact of 12 different 1-D global pooling functions on regression quality, and the superiority of 1-D global averaging is quantitatively proven.
arXiv Detail & Related papers (2023-03-01T16:12:47Z) - Learning brain MRI quality control: a multi-factorial generalization
problem [0.0]
This work aimed at evaluating the performances of the MRIQC pipeline on various large-scale datasets.
We focused our analysis on the MRIQC preprocessing steps and tested the pipeline with and without them.
We concluded that a model trained with data from a heterogeneous population, such as the CATI dataset, provides the best scores on unseen data.
arXiv Detail & Related papers (2022-05-31T15:46:44Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Improved Fine-tuning by Leveraging Pre-training Data: Theory and
Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications.
Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy.
We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z) - Back2Future: Leveraging Backfill Dynamics for Improving Real-time
Predictions in Future [73.03458424369657]
In real-time forecasting in public health, data collection is a non-trivial and demanding task.
'Backfill' phenomenon and its effect on model performance has been barely studied in the prior literature.
We formulate a novel problem and neural framework Back2Future that aims to refine a given model's predictions in real-time.
arXiv Detail & Related papers (2021-06-08T14:48:20Z) - Unlabelled Data Improves Bayesian Uncertainty Calibration under
Covariate Shift [100.52588638477862]
We develop an approximate Bayesian inference scheme based on posterior regularisation.
We demonstrate the utility of our method in the context of transferring prognostic models of prostate cancer across globally diverse populations.
arXiv Detail & Related papers (2020-06-26T13:50:19Z) - Evaluating Prediction-Time Batch Normalization for Robustness under
Covariate Shift [81.74795324629712]
We call prediction-time batch normalization, which significantly improves model accuracy and calibration under covariate shift.
We show that prediction-time batch normalization provides complementary benefits to existing state-of-the-art approaches for improving robustness.
The method has mixed results when used alongside pre-training, and does not seem to perform as well under more natural types of dataset shift.
arXiv Detail & Related papers (2020-06-19T05:08:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.