MutaGAN: A Seq2seq GAN Framework to Predict Mutations of Evolving
Protein Populations
- URL: http://arxiv.org/abs/2008.11790v1
- Date: Wed, 26 Aug 2020 20:20:30 GMT
- Title: MutaGAN: A Seq2seq GAN Framework to Predict Mutations of Evolving
Protein Populations
- Authors: Daniel S. Berman (1), Craig Howser (1), Thomas Mehoke (1), Jared D.
Evans (1) ((1) Johns Hopkins Applied Physics Laboratory, Laurel, United
States)
- Abstract summary: Influenza virus sequences were identified as an ideal test case for this deep learning framework.
MutaGAN generated "child" sequences from a given "parent" protein sequence with a median Levenshtein distance of 2.00 amino acids.
Results demonstrate the power of the MutaGAN framework to aid in pathogen forecasting with implications for broad utility in evolutionary prediction for any protein population.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The ability to predict the evolution of a pathogen would significantly
improve the ability to control, prevent, and treat disease. Despite significant
progress in other problem spaces, deep learning has yet to contribute to the
issue of predicting mutations of evolving populations. To address this gap, we
developed a novel machine learning framework using generative adversarial
networks (GANs) with recurrent neural networks (RNNs) to accurately predict
genetic mutations and evolution of future biological populations. Using a
generalized time-reversible phylogenetic model of protein evolution with
bootstrapped maximum likelihood tree estimation, we trained a
sequence-to-sequence generator within an adversarial framework, named MutaGAN,
to generate complete protein sequences augmented with possible mutations of
future virus populations. Influenza virus sequences were identified as an ideal
test case for this deep learning framework because it is a significant human
pathogen with new strains emerging annually and global surveillance efforts
have generated a large amount of publicly available data from the National
Center for Biotechnology Information's (NCBI) Influenza Virus Resource (IVR).
MutaGAN generated "child" sequences from a given "parent" protein sequence with
a median Levenshtein distance of 2.00 amino acids. Additionally, the generator
was able to augment the majority of parent proteins with at least one mutation
identified within the global influenza virus population. These results
demonstrate the power of the MutaGAN framework to aid in pathogen forecasting
with implications for broad utility in evolutionary prediction for any protein
population.
Related papers
- MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training [48.398329286769304]
Multiple Sequence Alignment (MSA) plays a pivotal role in unveiling the evolutionary trajectories of protein families.
MSAGPT is a novel approach to prompt protein structure predictions via MSA generative pretraining in the low MSA regime.
arXiv Detail & Related papers (2024-06-08T04:23:57Z) - Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification [119.13058298388101]
We develop a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances.
BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules.
BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules.
arXiv Detail & Related papers (2024-06-05T06:42:27Z) - Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning [78.38442423223832]
We develop a novel codebook pre-training task, namely masked microenvironment modeling.
We demonstrate superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction.
arXiv Detail & Related papers (2024-05-16T03:53:21Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Unsupervised language models for disease variant prediction [3.6942566104432886]
We find that a single protein LM trained on broad sequence datasets can score pathogenicity for any gene variant zero-shot.
We show that it achieves scoring performance comparable to the state of the art when evaluated on clinically labeled variants of disease-related genes.
arXiv Detail & Related papers (2022-12-07T22:28:13Z) - Scalable Pathogen Detection from Next Generation DNA Sequencing with
Deep Learning [3.8175773487333857]
We propose MG2Vec, a deep learning-based solution that uses the transformer network as its backbone.
We show that the proposed approach can help detect pathogens from uncurated, real-world clinical samples.
We provide a comprehensive evaluation of a novel representation learning framework for metagenome-based disease diagnostics with deep learning.
arXiv Detail & Related papers (2022-11-30T00:13:59Z) - PhyloTransformer: A Discriminative Model for Mutation Prediction Based
on a Multi-head Self-attention Mechanism [10.468453827172477]
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused an ongoing pandemic infecting 219 million people as of 10/19/21, with a 3.6% mortality rate.
Here we developed PhyloTransformer, a Transformer-based discriminative model that engages a multi-head self-attention mechanism to model genetic mutations that may lead to viral reproductive advantage.
arXiv Detail & Related papers (2021-11-03T01:30:57Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z) - Classification of Influenza Hemagglutinin Protein Sequences using
Convolutional Neural Networks [8.397189036839956]
This paper focuses on accurately predicting if an Influenza type A virus can infect specific hosts, and more specifically, Human, Avian and Swine hosts, using only the protein sequence of the HA gene.
We propose encoding the protein sequences into numerical signals using the Hydrophobicity Index and subsequently utilising a Convolutional Neural Network-based predictive model.
As the results show, the proposed model can distinguish HA protein sequences with high accuracy whenever the virus under investigation can infect Human, Avian or Swine hosts.
arXiv Detail & Related papers (2021-08-09T10:42:26Z) - Epigenetic evolution of deep convolutional models [81.21462458089142]
We build upon a previously proposed neuroevolution framework to evolve deep convolutional models.
We propose a convolutional layer layout which allows kernels of different shapes and sizes to coexist within the same layer.
The proposed layout enables the size and shape of individual kernels within a convolutional layer to be evolved with a corresponding new mutation operator.
arXiv Detail & Related papers (2021-04-12T12:45:16Z) - Modelling SARS-CoV-2 coevolution with genetic algorithms [0.0]
SARS-CoV-2 outbreak shook policy responses to the emergence of virus variants.
We propose coevolution with genetic algorithms (GAs) as a credible approach to model this relationship.
We present a dual GA model in which both viruses aiming for survival and policy measures aiming at minimising infection rates, competitively evolve.
arXiv Detail & Related papers (2021-02-24T15:49:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.