Application of data engineering approaches to address challenges in
microbiome data for optimal medical decision-making
- URL: http://arxiv.org/abs/2307.00033v2
- Date: Tue, 11 Jul 2023 11:01:42 GMT
- Title: Application of data engineering approaches to address challenges in
microbiome data for optimal medical decision-making
- Authors: Isha Thombre, Pavan Kumar Perepu, Shyam Kumar Sudhakar
- Abstract summary: The study addresses the issues inherent to microbiome datasets and could be highly beneficial for providing personalized medicine.
The prototype employed in the study addresses the issues inherent to microbiome datasets and could be highly beneficial for providing personalized medicine.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The human gut microbiota is known to contribute to numerous physiological
functions of the body and also implicated in a myriad of pathological
conditions. Prolific research work in the past few decades have yielded
valuable information regarding the relative taxonomic distribution of gut
microbiota. Unfortunately, the microbiome data suffers from class imbalance and
high dimensionality issues that must be addressed. In this study, we have
implemented data engineering algorithms to address the above-mentioned issues
inherent to microbiome data. Four standard machine learning classifiers
(logistic regression (LR), support vector machines (SVM), random forests (RF),
and extreme gradient boosting (XGB) decision trees) were implemented on a
previously published dataset. The issue of class imbalance and high
dimensionality of the data was addressed through synthetic minority
oversampling technique (SMOTE) and principal component analysis (PCA). Our
results indicate that ensemble classifiers (RF and XGB decision trees) exhibit
superior classification accuracy in predicting the host phenotype. The
application of PCA significantly reduced testing time while maintaining high
classification accuracy. The highest classification accuracy was obtained at
the levels of species for most classifiers. The prototype employed in the study
addresses the issues inherent to microbiome datasets and could be highly
beneficial for providing personalized medicine.
Related papers
- Artificial Data Point Generation in Clustered Latent Space for Small
Medical Datasets [4.542616945567623]
This paper introduces a novel method, Artificial Data Point Generation in Clustered Latent Space (AGCL)
AGCL is designed to enhance classification performance on small medical datasets through synthetic data generation.
It was applied to Parkinson's disease screening, utilizing facial expression data.
arXiv Detail & Related papers (2024-09-26T09:51:08Z) - ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering [54.80411755871931]
Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth.
Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format.
This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful.
We introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data.
arXiv Detail & Related papers (2024-07-24T01:46:55Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Human Limits in Machine Learning: Prediction of Plant Phenotypes Using
Soil Microbiome Data [0.2812395851874055]
We provide the first deep investigation of the predictive potential of machine learning models to understand the connections between soil and biological phenotypes.
We show that prediction is improved when incorporating environmental features like soil physicochemical properties and microbial population density into the models.
arXiv Detail & Related papers (2023-06-19T20:52:37Z) - Label scarcity in biomedicine: Data-rich latent factor discovery
enhances phenotype prediction [102.23901690661916]
Low-dimensional embedding spaces can be derived from the UK Biobank population dataset to enhance data-scarce prediction of health indicators, lifestyle and demographic characteristics.
Performances gains from semisupervison approaches will probably become an important ingredient for various medical data science applications.
arXiv Detail & Related papers (2021-10-12T16:25:50Z) - Deep neural networks approach to microbial colony detection -- a
comparative analysis [52.77024349608834]
This study investigates the performance of three deep learning approaches for object detection on the AGAR dataset.
The achieved results may serve as a benchmark for future experiments.
arXiv Detail & Related papers (2021-08-23T12:06:00Z) - Data-Driven Logistic Regression Ensembles With Applications in Genomics [0.0]
We propose a new approach for dealing with high-dimensional binary classification problems that combines ideas from regularization and ensembling.
We demonstrate the good performance of our method in terms of prediction accuracy and identification of key biomarkers using several medical datasets involving common diseases such as cancer, multiple sclerosis and psoriasis.
arXiv Detail & Related papers (2021-02-17T05:57:26Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z) - Mycorrhiza: Genotype Assignment usingPhylogenetic Networks [2.286041284499166]
We introduce Mycorrhiza, a machine learning approach for the genotype assignment problem.
Our algorithm makes use of phylogenetic networks to engineer features that encode the evolutionary relationships among samples.
Mycorrhiza yields particularly significant gains on datasets with a large average fixation index (FST) or deviation from the Hardy-Weinberg equilibrium.
arXiv Detail & Related papers (2020-10-14T02:36:27Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.