Classification of Influenza Hemagglutinin Protein Sequences using
Convolutional Neural Networks
- URL: http://arxiv.org/abs/2108.04240v1
- Date: Mon, 9 Aug 2021 10:42:26 GMT
- Title: Classification of Influenza Hemagglutinin Protein Sequences using
Convolutional Neural Networks
- Authors: Charalambos Chrysostomou, Floris Alexandrou, Mihalis A. Nicolaou and
Huseyin Seker
- Abstract summary: This paper focuses on accurately predicting if an Influenza type A virus can infect specific hosts, and more specifically, Human, Avian and Swine hosts, using only the protein sequence of the HA gene.
We propose encoding the protein sequences into numerical signals using the Hydrophobicity Index and subsequently utilising a Convolutional Neural Network-based predictive model.
As the results show, the proposed model can distinguish HA protein sequences with high accuracy whenever the virus under investigation can infect Human, Avian or Swine hosts.
- Score: 8.397189036839956
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Influenza virus can be considered as one of the most severe viruses that
can infect multiple species with often fatal consequences to the hosts. The
Hemagglutinin (HA) gene of the virus can be a target for antiviral drug
development realised through accurate identification of its sub-types and
possible the targeted hosts. This paper focuses on accurately predicting if an
Influenza type A virus can infect specific hosts, and more specifically, Human,
Avian and Swine hosts, using only the protein sequence of the HA gene. In more
detail, we propose encoding the protein sequences into numerical signals using
the Hydrophobicity Index and subsequently utilising a Convolutional Neural
Network-based predictive model. The Influenza HA protein sequences used in the
proposed work are obtained from the Influenza Research Database (IRD).
Specifically, complete and unique HA protein sequences were used for avian,
human and swine hosts. The data obtained for this work was 17999 human-host
proteins, 17667 avian-host proteins and 9278 swine-host proteins. Given this
set of collected proteins, the proposed method yields as much as 10% higher
accuracy for an individual class (namely, Avian) and 5% higher overall accuracy
than in an earlier study. It is also observed that the accuracy for each class
in this work is more balanced than what was presented in this earlier study. As
the results show, the proposed model can distinguish HA protein sequences with
high accuracy whenever the virus under investigation can infect Human, Avian or
Swine hosts.
Related papers
- Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals.
Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z) - Clustering for Protein Representation Learning [72.72957540484664]
We propose a neural clustering framework that can automatically discover the critical components of a protein.
Our framework treats a protein as a graph, where each node represents an amino acid and each edge represents a spatial or sequential connection between amino acids.
We evaluate on four protein-related tasks: protein fold classification, enzyme reaction classification, gene term prediction, and enzyme commission number prediction.
arXiv Detail & Related papers (2024-03-30T05:51:09Z) - NaNa and MiGu: Semantic Data Augmentation Techniques to Enhance Protein Classification in Graph Neural Networks [60.48306899271866]
We propose novel semantic data augmentation methods to incorporate backbone chemical and side-chain biophysical information into protein classification tasks.
Specifically, we leverage molecular biophysical, secondary structure, chemical bonds, andionic features of proteins to facilitate classification tasks.
arXiv Detail & Related papers (2024-03-21T13:27:57Z) - MC-NN: An End-to-End Multi-Channel Neural Network Approach for
Predicting Influenza A Virus Hosts and Antigenic Types [5.067354030054702]
Influenza poses a significant threat to public health, particularly among the elderly, young children, and people with underlying dis-eases.
We propose a multi-channel neural network model to predict the host and antigenic sub-types of influenza A viruses.
arXiv Detail & Related papers (2023-06-08T23:14:39Z) - A Latent Diffusion Model for Protein Structure Generation [50.74232632854264]
We propose a latent diffusion model that can reduce the complexity of protein modeling.
We show that our method can effectively generate novel protein backbone structures with high designability and efficiency.
arXiv Detail & Related papers (2023-05-06T19:10:19Z) - Virus2Vec: Viral Sequence Classification Using Machine Learning [48.40285316053593]
We propose Virus2Vec, a feature-vector representation for viral sequences that enable machine learning models to identify viral hosts.
We empirically evaluate Virus2Vec on real-world spike sequences of Coronaviridae and rabies virus sequence data to predict the host.
Our results demonstrate that Virus2Vec outperforms the predictive accuracies of baseline and state-of-the-art methods.
arXiv Detail & Related papers (2023-04-24T08:17:16Z) - Dive into Machine Learning Algorithms for Influenza Virus Host Prediction with Hemagglutinin Sequences [4.289396744209968]
Influenza viruses mutate rapidly and can pose a threat to public health, especially to those in vulnerable groups.
Recently, there has been increasing interest in using machine learning algorithms to provide fast and accurate predictions for viral sequences.
In this study, real testing data sets and a variety of evaluation metrics were used to evaluate machine learning algorithms at different taxonomic levels.
arXiv Detail & Related papers (2022-07-28T00:54:54Z) - Multi-channel neural networks for predicting influenza A virus hosts and
antigenic types [3.1981440103815717]
A fast, accurate and low-cost method to predict the origin host and subtype of influenza viruses could help reduce virus transmission and benefit resource-poor areas.
We propose multi-channel neural networks to predict antigenic types and hosts of influenza A viruses with complete and partial protein sequences.
arXiv Detail & Related papers (2022-06-08T11:47:31Z) - Accurate Virus Identification with Interpretable Raman Signatures by
Machine Learning [12.184128048998906]
We present a machine learning approach for analyzing Raman spectra of human and avian viruses.
A Convolutional Neural Network (CNN) classifier specifically designed for spectral data achieves very high accuracy for a variety of virus type or subtype identification tasks.
arXiv Detail & Related papers (2022-06-05T22:31:14Z) - A k-mer Based Approach for SARS-CoV-2 Variant Identification [55.78588835407174]
We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance.
We also show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC)
arXiv Detail & Related papers (2021-08-07T15:08:15Z) - MutaGAN: A Seq2seq GAN Framework to Predict Mutations of Evolving
Protein Populations [0.0]
Influenza virus sequences were identified as an ideal test case for this deep learning framework.
MutaGAN generated "child" sequences from a given "parent" protein sequence with a median Levenshtein distance of 2.00 amino acids.
Results demonstrate the power of the MutaGAN framework to aid in pathogen forecasting with implications for broad utility in evolutionary prediction for any protein population.
arXiv Detail & Related papers (2020-08-26T20:20:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.