Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification
- URL: http://arxiv.org/abs/2207.08898v1
- Date: Mon, 18 Jul 2022 19:16:56 GMT
- Title: Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification
- Authors: Sarwan Ali, Bikram Sahoo, Alexander Zelikovskiy, Pin-Yu Chen, Murray
Patterson
- Abstract summary: We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
- Score: 109.81283748940696
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid spread of the COVID-19 pandemic has resulted in an unprecedented
amount of sequence data of the SARS-CoV-2 genome -- millions of sequences and
counting. This amount of data, while being orders of magnitude beyond the
capacity of traditional approaches to understanding the diversity, dynamics,
and evolution of viruses is nonetheless a rich resource for machine learning
(ML) approaches as alternatives for extracting such important information from
these data. It is of hence utmost importance to design a framework for testing
and benchmarking the robustness of these ML models.
This paper makes the first effort (to our knowledge) to benchmark the
robustness of ML models by simulating biological sequences with errors. In this
paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to
mimic the error profiles of common sequencing platforms such as Illumina and
PacBio. We show from experiments on a wide array of ML models that some
simulation-based approaches are more robust (and accurate) than others for
specific embedding methods to certain adversarial attacks to the input
sequences. Our benchmarking framework may assist researchers in properly
assessing different ML models and help them understand the behavior of the
SARS-CoV-2 virus or avoid possible future pandemics.
Related papers
- Predictive Analytics of Varieties of Potatoes [2.336821989135698]
We explore the application of machine learning algorithms specifically to enhance the selection process of Russet potato clones in breeding trials.
This study addresses the challenge of efficiently identifying high-yield, disease-resistant, and climate-resilient potato varieties.
arXiv Detail & Related papers (2024-04-04T00:49:05Z) - AI enhanced data assimilation and uncertainty quantification applied to
Geological Carbon Storage [0.0]
We introduce the Surrogate-based hybrid ESMDA (SH-ESMDA), an adaptation of the traditional Ensemble Smoother with Multiple Data Assimilation (ESMDA)
We also introduce Surrogate-based Hybrid RML (SH-RML), a variational data assimilation approach that relies on the randomized maximum likelihood (RML)
Our comparative analyses show that SH-RML offers better uncertainty compared to conventional ESMDA for the case study.
arXiv Detail & Related papers (2024-02-09T00:24:46Z) - Building Robust Machine Learning Models for Small Chemical Science Data:
The Case of Shear Viscosity [3.4761212729163313]
We train several Machine Learning models to predict the shear viscosity of a Lennard-Jones (LJ) fluid.
Specifically, the issues related to model selection, performance estimation and uncertainty quantification were investigated.
arXiv Detail & Related papers (2022-08-23T07:33:14Z) - Modelling COVID-19 Pandemic Dynamics Using Transparent, Interpretable,
Parsimonious and Simulatable (TIPS) Machine Learning Models: A Case Study
from Systems Thinking and System Identification Perspectives [1.4061680807550718]
The present study proposes using systems engineering and system identification approach to build transparent, interpretable, parsimonious and simulatable (TIPS) dynamic machine learning models.
TIPS models are developed based on the well-known NARMAX (Nonlinear AutoRegressive Moving Average with eXogenous inputs) model, which can help better understand the COVID-19 pandemic dynamics.
arXiv Detail & Related papers (2021-11-01T08:42:37Z) - Non-stationary Gaussian process discriminant analysis with variable
selection for high-dimensional functional data [0.0]
High-dimensional classification and feature selection are ubiquitous with the recent advancement in data acquisition technology.
These structures pose additional challenges to commonly used methods that rely mainly on a two-stage approach performing variable selection and classification separately.
We propose in this work a novel Gaussian process discriminant analysis (GPDA) that combines these steps in a unified framework.
arXiv Detail & Related papers (2021-09-29T03:35:49Z) - A k-mer Based Approach for SARS-CoV-2 Variant Identification [55.78588835407174]
We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance.
We also show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC)
arXiv Detail & Related papers (2021-08-07T15:08:15Z) - Learning Gaussian Mixtures with Generalised Linear Models: Precise
Asymptotics in High-dimensions [79.35722941720734]
Generalised linear models for multi-class classification problems are one of the fundamental building blocks of modern machine learning tasks.
We prove exacts characterising the estimator in high-dimensions via empirical risk minimisation.
We discuss how our theory can be applied beyond the scope of synthetic data.
arXiv Detail & Related papers (2021-06-07T16:53:56Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - Two-step penalised logistic regression for multi-omic data with an
application to cardiometabolic syndrome [62.997667081978825]
We implement a two-step approach to multi-omic logistic regression in which variable selection is performed on each layer separately.
Our approach should be preferred if the goal is to select as many relevant predictors as possible.
Our proposed approach allows us to identify features that characterise cardiometabolic syndrome at the molecular level.
arXiv Detail & Related papers (2020-08-01T10:36:27Z) - A Systematic Approach to Featurization for Cancer Drug Sensitivity
Predictions with Deep Learning [49.86828302591469]
We train >35,000 neural network models, sweeping over common featurization techniques.
We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features.
arXiv Detail & Related papers (2020-04-30T20:42:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.