Machine Learning Performance Analysis to Predict Stroke Based on
Imbalanced Medical Dataset
- URL: http://arxiv.org/abs/2211.07652v1
- Date: Mon, 14 Nov 2022 17:36:46 GMT
- Title: Machine Learning Performance Analysis to Predict Stroke Based on
Imbalanced Medical Dataset
- Authors: Yuru Jing
- Abstract summary: Cerebral stroke, the second most substantial cause of death universally, has been a primary public health concern over the last few years.
Medical dataset are frequently unbalanced in their class label, with a tendency to poorly predict minority classes.
In this paper, the potential risk factors for stroke are investigated.
Four distinctive approaches are applied to improve the classification of the minority class in the imbalanced stroke dataset.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cerebral stroke, the second most substantial cause of death universally, has
been a primary public health concern over the last few years. With the help of
machine learning techniques, early detection of various stroke alerts is
accessible, which can efficiently prevent or diminish the stroke. Medical
dataset, however, are frequently unbalanced in their class label, with a
tendency to poorly predict minority classes. In this paper, the potential risk
factors for stroke are investigated. Moreover, four distinctive approaches are
applied to improve the classification of the minority class in the imbalanced
stroke dataset, which are the ensemble weight voting classifier, the Synthetic
Minority Over-sampling Technique (SMOTE), Principal Component Analysis with
K-Means Clustering (PCA-Kmeans), Focal Loss with the Deep Neural Network (DNN)
and compare their performance. Through the analysis results, SMOTE and
PCA-Kmeans with DNN-Focal Loss work best for the limited size of a large severe
imbalanced dataset,which is 2-4 times outperform Kaggle work.
Related papers
- Few-shot learning for COVID-19 Chest X-Ray Classification with
Imbalanced Data: An Inter vs. Intra Domain Study [49.5374512525016]
Medical image datasets are essential for training models used in computer-aided diagnosis, treatment planning, and medical research.
Some challenges are associated with these datasets, including variability in data distribution, data scarcity, and transfer learning issues when using models pre-trained from generic images.
We propose a methodology based on Siamese neural networks in which a series of techniques are integrated to mitigate the effects of data scarcity and distribution imbalance.
arXiv Detail & Related papers (2024-01-18T16:59:27Z) - MCRAGE: Synthetic Healthcare Data for Fairness [3.0089659534785853]
We propose Minority Class Rebalancing through Augmentation by Generative modeling (MCRAGE) to augment imbalanced datasets.
MCRAGE involves training a Denoising Diffusion Probabilistic Model (CDDPM) capable of generating high-quality synthetic EHR samples from underrepresented classes.
We use this synthetic data to augment the existing imbalanced dataset, resulting in a more balanced distribution across all classes.
arXiv Detail & Related papers (2023-10-27T19:02:22Z) - Investigating Group Distributionally Robust Optimization for Deep
Imbalanced Learning: A Case Study of Binary Tabular Data Classification [0.44040106718326594]
Group distributionally robust optimization (gDRO) is investigated in this study for imbalance learning.
Experimental findings in comparison with empirical risk minimization (ERM) and classical imbalance methods reveal impressive performance of gDRO.
arXiv Detail & Related papers (2023-03-04T21:20:58Z) - Ischemic Stroke Lesion Prediction using imbalanced Temporal Deep
Gaussian Process (iTDGP) [2.649401887836554]
Acute Ischemic Stroke (AIS) occurs when the blood supply to the brain is suddenly interrupted because of a blocked artery.
Current standard AIS assessment method, which thresholds the 3D measurement maps extracted from Computed Tomography Perfusion (CTP) images, is not accurate enough.
We propose imbalanced Temporal Deep Process (iTDGP), a probabilistic model that can improve AIS prediction by using baseline Gaussian time series.
arXiv Detail & Related papers (2022-11-16T17:32:29Z) - RoS-KD: A Robust Stochastic Knowledge Distillation Approach for Noisy
Medical Imaging [67.02500668641831]
Deep learning models trained on noisy datasets are sensitive to the noise type and lead to less generalization on unseen samples.
We propose a Robust Knowledge Distillation (RoS-KD) framework which mimics the notion of learning a topic from multiple sources to ensure deterrence in learning noisy information.
RoS-KD learns a smooth, well-informed, and robust student manifold by distilling knowledge from multiple teachers trained on overlapping subsets of training data.
arXiv Detail & Related papers (2022-10-15T22:32:20Z) - Density-Aware Personalized Training for Risk Prediction in Imbalanced
Medical Data [89.79617468457393]
Training models with imbalance rate (class density discrepancy) may lead to suboptimal prediction.
We propose a framework for training models for this imbalance issue.
We demonstrate our model's improved performance in real-world medical datasets.
arXiv Detail & Related papers (2022-07-23T00:39:53Z) - A predictive analytics approach for stroke prediction using machine
learning and neural networks [4.984181486695979]
This paper systematically analyzes the various factors in electronic health records for effective stroke prediction.
Age, heart disease, average glucose level, and hypertension are the most important factors for detecting stroke in patients.
A perceptron neural network using these four attributes provides the highest accuracy rate and lowest miss rate.
arXiv Detail & Related papers (2022-03-01T14:45:15Z) - Cross-Site Severity Assessment of COVID-19 from CT Images via Domain
Adaptation [64.59521853145368]
Early and accurate severity assessment of Coronavirus disease 2019 (COVID-19) based on computed tomography (CT) images offers a great help to the estimation of intensive care unit event.
To augment the labeled data and improve the generalization ability of the classification model, it is necessary to aggregate data from multiple sites.
This task faces several challenges including class imbalance between mild and severe infections, domain distribution discrepancy between sites, and presence of heterogeneous features.
arXiv Detail & Related papers (2021-09-08T07:56:51Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Predictive Modeling of ICU Healthcare-Associated Infections from
Imbalanced Data. Using Ensembles and a Clustering-Based Undersampling
Approach [55.41644538483948]
This work is focused on both the identification of risk factors and the prediction of healthcare-associated infections in intensive-care units.
The aim is to support decision making addressed at reducing the incidence rate of infections.
arXiv Detail & Related papers (2020-05-07T16:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.