PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model
- URL: http://arxiv.org/abs/2406.13133v1
- Date: Wed, 19 Jun 2024 00:53:48 GMT
- Title: PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model
- Authors: Sajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song, Petra Choi, Xuan Wang, Liqing Zhang,
- Abstract summary: PathoLM is a cutting-edge pathogen language model optimized for the identification of pathogenicity in bacterial and viral sequences.
We developed a comprehensive data set comprising approximately 30 species of viruses and bacteria, including ESKAPEE pathogens.
In comparative assessments, PathoLM dramatically outperforms existing models like DciPatho, demonstrating robust zero-shot and few-shot capabilities.
- Score: 9.285895422810704
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pathogen identification is pivotal in diagnosing, treating, and preventing diseases, crucial for controlling infections and safeguarding public health. Traditional alignment-based methods, though widely used, are computationally intense and reliant on extensive reference databases, often failing to detect novel pathogens due to their low sensitivity and specificity. Similarly, conventional machine learning techniques, while promising, require large annotated datasets and extensive feature engineering and are prone to overfitting. Addressing these challenges, we introduce PathoLM, a cutting-edge pathogen language model optimized for the identification of pathogenicity in bacterial and viral sequences. Leveraging the strengths of pre-trained DNA models such as the Nucleotide Transformer, PathoLM requires minimal data for fine-tuning, thereby enhancing pathogen detection capabilities. It effectively captures a broader genomic context, significantly improving the identification of novel and divergent pathogens. We developed a comprehensive data set comprising approximately 30 species of viruses and bacteria, including ESKAPEE pathogens, seven notably virulent bacterial strains resistant to antibiotics. Additionally, we curated a species classification dataset centered specifically on the ESKAPEE group. In comparative assessments, PathoLM dramatically outperforms existing models like DciPatho, demonstrating robust zero-shot and few-shot capabilities. Furthermore, we expanded PathoLM-Sp for ESKAPEE species classification, where it showed superior performance compared to other advanced deep learning methods, despite the complexities of the task.
Related papers
- Deep-Ace: LSTM-based Prokaryotic Lysine Acetylation Site Predictor [10.293190253043049]
Acetylation of lysine residues (K-Ace) is a post-translation modification occurring in both prokaryotes and eukaryotes.
In the current work we propose Deep-Ace, a deep learning-based framework using Long-Short-Term-Memory (LSTM) network.
Our proposed method has outperformed existing state of the art models achieving accuracy as 0.80, 0.79, 0.71, 0.75, 0.80, 0.83, 0.756, and 0.82 respectively for eight bacterial species.
arXiv Detail & Related papers (2024-10-13T19:10:57Z) - BeeTLe: A Framework for Linear B-Cell Epitope Prediction and
Classification [0.43512163406551996]
This paper presents a new deep learning-based framework for linear B-cell prediction as well as antibody type-specific classification.
We propose an amino acid encoding method based on eigen decomposition to help the model learn the representations of antibodies.
Experimental results on data curated from the largest public database demonstrate the validity of the proposed methods.
arXiv Detail & Related papers (2023-09-05T09:18:29Z) - Scalable Pathogen Detection from Next Generation DNA Sequencing with
Deep Learning [3.8175773487333857]
We propose MG2Vec, a deep learning-based solution that uses the transformer network as its backbone.
We show that the proposed approach can help detect pathogens from uncurated, real-world clinical samples.
We provide a comprehensive evaluation of a novel representation learning framework for metagenome-based disease diagnostics with deep learning.
arXiv Detail & Related papers (2022-11-30T00:13:59Z) - Reprogramming Pretrained Language Models for Antibody Sequence Infilling [72.13295049594585]
Computational design of antibodies involves generating novel and diverse sequences, while maintaining structural consistency.
Recent deep learning models have shown impressive results, however the limited number of known antibody sequence/structure pairs frequently leads to degraded performance.
In our work we address this challenge by leveraging Model Reprogramming (MR), which repurposes pretrained models on a source language to adapt to the tasks that are in a different language and have scarce data.
arXiv Detail & Related papers (2022-10-05T20:44:55Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Deep neural networks approach to microbial colony detection -- a
comparative analysis [52.77024349608834]
This study investigates the performance of three deep learning approaches for object detection on the AGAR dataset.
The achieved results may serve as a benchmark for future experiments.
arXiv Detail & Related papers (2021-08-23T12:06:00Z) - A k-mer Based Approach for SARS-CoV-2 Variant Identification [55.78588835407174]
We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance.
We also show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC)
arXiv Detail & Related papers (2021-08-07T15:08:15Z) - A Cross-Level Information Transmission Network for Predicting Phenotype
from New Genotype: Application to Cancer Precision Medicine [37.442717660492384]
We propose a novel Cross-LEvel Information Transmission network (CLEIT) framework.
Inspired by domain adaptation, CLEIT first learns the latent representation of high-level domain then uses it as ground-truth embedding.
We demonstrate the effectiveness and performance boost of CLEIT in predicting anti-cancer drug sensitivity from somatic mutations.
arXiv Detail & Related papers (2020-10-09T22:01:00Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - Genome Sequence Classification for Animal Diagnostics with Graph
Representations and Deep Neural Networks [4.339839287869652]
Bovine Respiratory Disease Complex (BRDC) is a complex respiratory disease in cattle with multiple etiologies, including bacterial and viral.
Current animal disease diagnostics is based on traditional tests such as bacterial culture, serolog, and Polymerase Chain Reaction (PCR) tests.
We show that networks-based machine learning approaches can detect pathogen signature with up to 89.7% accuracy.
arXiv Detail & Related papers (2020-07-24T22:30:18Z) - Accelerating Antimicrobial Discovery with Controllable Deep Generative
Models and Molecular Dynamics [109.70543391923344]
CLaSS (Controlled Latent attribute Space Sampling) is an efficient computational method for attribute-controlled generation of molecules.
We screen the generated molecules for additional key attributes by using deep learning classifiers in conjunction with novel features derived from atomistic simulations.
The proposed approach is demonstrated for designing non-toxic antimicrobial peptides (AMPs) with strong broad-spectrum potency.
arXiv Detail & Related papers (2020-05-22T15:57:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.