Related papers: Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification

Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification

URL: http://arxiv.org/abs/2207.08898v1
Date: Mon, 18 Jul 2022 19:16:56 GMT
Title: Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification
Authors: Sarwan Ali, Bikram Sahoo, Alexander Zelikovskiy, Pin-Yu Chen, Murray Patterson
Abstract summary: We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
Score: 109.81283748940696
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome -- millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models. This paper makes the first effort (to our knowledge) to benchmark the robustness of ML models by simulating biological sequences with errors. In this paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML models that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences. Our benchmarking framework may assist researchers in properly assessing different ML models and help them understand the behavior of the SARS-CoV-2 virus or avoid possible future pandemics.

Related papers

Comprehensive Benchmarking of Machine Learning Methods for Risk Prediction Modelling from Large-Scale Survival Data: A UK Biobank Study [0.0]
Large-scale prospective cohort studies and a diverse toolkit of available machine learning (ML) algorithms have facilitated such survival task efforts. We sought to benchmark eight distinct survival task implementations, ranging from linear to deep learning (DL) models. We assessed how well different architectures scale with sample sizes ranging from n = 5,000 to n = 250,000 individuals.
arXiv Detail & Related papers (2025-03-11T20:27:20Z)
A Robust Support Vector Machine Approach for Raman COVID-19 Data Classification [0.7864304771129751]
In this paper, we investigate the performance of a novel robust formulation for Support Vector Machine (SVM) in classifying COVID-19 samples obtained from Raman spectroscopy. We derive robust counterpart models of deterministic formulations using bounded-by-norm uncertainty sets around each observation. The effectiveness of our approach is validated on real-world COVID-19 datasets provided by Italian hospitals.
arXiv Detail & Related papers (2025-01-29T14:02:45Z)
Predictive Analytics of Varieties of Potatoes [2.336821989135698]
We explore the application of machine learning algorithms specifically to enhance the selection process of Russet potato clones in breeding trials. This study addresses the challenge of efficiently identifying high-yield, disease-resistant, and climate-resilient potato varieties.
arXiv Detail & Related papers (2024-04-04T00:49:05Z)
AI enhanced data assimilation and uncertainty quantification applied to Geological Carbon Storage [0.0]
We introduce the Surrogate-based hybrid ESMDA (SH-ESMDA), an adaptation of the traditional Ensemble Smoother with Multiple Data Assimilation (ESMDA) We also introduce Surrogate-based Hybrid RML (SH-RML), a variational data assimilation approach that relies on the randomized maximum likelihood (RML) Our comparative analyses show that SH-RML offers better uncertainty compared to conventional ESMDA for the case study.
arXiv Detail & Related papers (2024-02-09T00:24:46Z)
Building Robust Machine Learning Models for Small Chemical Science Data: The Case of Shear Viscosity [3.4761212729163313]
We train several Machine Learning models to predict the shear viscosity of a Lennard-Jones (LJ) fluid. Specifically, the issues related to model selection, performance estimation and uncertainty quantification were investigated.
arXiv Detail & Related papers (2022-08-23T07:33:14Z)
Modelling COVID-19 Pandemic Dynamics Using Transparent, Interpretable, Parsimonious and Simulatable (TIPS) Machine Learning Models: A Case Study from Systems Thinking and System Identification Perspectives [1.4061680807550718]
The present study proposes using systems engineering and system identification approach to build transparent, interpretable, parsimonious and simulatable (TIPS) dynamic machine learning models. TIPS models are developed based on the well-known NARMAX (Nonlinear AutoRegressive Moving Average with eXogenous inputs) model, which can help better understand the COVID-19 pandemic dynamics.
arXiv Detail & Related papers (2021-11-01T08:42:37Z)
Non-stationary Gaussian process discriminant analysis with variable selection for high-dimensional functional data [0.0]
High-dimensional classification and feature selection are ubiquitous with the recent advancement in data acquisition technology. These structures pose additional challenges to commonly used methods that rely mainly on a two-stage approach performing variable selection and classification separately. We propose in this work a novel Gaussian process discriminant analysis (GPDA) that combines these steps in a unified framework.
arXiv Detail & Related papers (2021-09-29T03:35:49Z)
A k-mer Based Approach for SARS-CoV-2 Variant Identification [55.78588835407174]
We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance. We also show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC)
arXiv Detail & Related papers (2021-08-07T15:08:15Z)
Learning Gaussian Mixtures with Generalised Linear Models: Precise Asymptotics in High-dimensions [79.35722941720734]
Generalised linear models for multi-class classification problems are one of the fundamental building blocks of modern machine learning tasks. We prove exacts characterising the estimator in high-dimensions via empirical risk minimisation. We discuss how our theory can be applied beyond the scope of synthetic data.
arXiv Detail & Related papers (2021-06-07T16:53:56Z)
Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients. We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks. Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
Two-step penalised logistic regression for multi-omic data with an application to cardiometabolic syndrome [62.997667081978825]
We implement a two-step approach to multi-omic logistic regression in which variable selection is performed on each layer separately. Our approach should be preferred if the goal is to select as many relevant predictors as possible. Our proposed approach allows us to identify features that characterise cardiometabolic syndrome at the molecular level.
arXiv Detail & Related papers (2020-08-01T10:36:27Z)
A Systematic Approach to Featurization for Cancer Drug Sensitivity Predictions with Deep Learning [49.86828302591469]
We train >35,000 neural network models, sweeping over common featurization techniques. We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features.
arXiv Detail & Related papers (2020-04-30T20:42:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.