ProtiGeno: a prokaryotic short gene finder using protein language models
- URL: http://arxiv.org/abs/2307.10343v1
- Date: Wed, 19 Jul 2023 16:46:42 GMT
- Title: ProtiGeno: a prokaryotic short gene finder using protein language models
- Authors: Tony Tu, Gautham Krishna, Amirali Aghazadeh
- Abstract summary: Current gene finders are highly sensitive in finding long genes, but their sensitivity decreases noticeably in finding shorter genes.
We develop a deep learning-based method called ProtiGeno, specifically targeting short prokaryotic genes.
In systematic large-scale experiments on 4,288 prokaryotic genomes, we demonstrate that ProtiGeno predicts short coding and noncoding genes with higher accuracy and recall than the current state-of-the-art gene finders.
- Score: 1.2354076490479513
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prokaryotic gene prediction plays an important role in understanding the
biology of organisms and their function with applications in medicine and
biotechnology. Although the current gene finders are highly sensitive in
finding long genes, their sensitivity decreases noticeably in finding shorter
genes (<180 nts). The culprit is insufficient annotated gene data to identify
distinguishing features in short open reading frames (ORFs). We develop a deep
learning-based method called ProtiGeno, specifically targeting short
prokaryotic genes using a protein language model trained on millions of evolved
proteins. In systematic large-scale experiments on 4,288 prokaryotic genomes,
we demonstrate that ProtiGeno predicts short coding and noncoding genes with
higher accuracy and recall than the current state-of-the-art gene finders. We
discuss the predictive features of ProtiGeno and possible limitations by
visualizing the three-dimensional structure of the predicted short genes. Data,
codes, and models are available at https://github.com/tonytu16/protigeno.
Related papers
- Learning to Discover Regulatory Elements for Gene Expression Prediction [59.470991831978516]
Seq2Exp is a Sequence to Expression network designed to discover and extract regulatory elements that drive target gene expression.
Our approach captures the causal relationship between epigenomic signals, DNA sequences and their associated regulatory elements.
arXiv Detail & Related papers (2025-02-19T03:25:49Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.
The model adheres to the central dogma of molecular biology, accurately generating protein-coding sequences.
It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of promoter sequences.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Survey and Improvement Strategies for Gene Prioritization with Large Language Models [61.24568051916653]
Large language models (LLMs) have performed well in medical exams, but their effectiveness in diagnosing rare genetic diseases has not been assessed.
We used multi-agent and Human Phenotype Ontology (HPO) classification to categorized patients based on phenotypes and solvability levels.
At baseline, GPT-4 outperformed other LLMs, achieving near 30% accuracy in ranking causal genes correctly.
arXiv Detail & Related papers (2025-01-30T23:03:03Z) - GeneQuery: A General QA-based Framework for Spatial Gene Expression Predictions from Histology Images [41.732831871866516]
Whole-slide hematoxylin and eosin stained histological images are readily accessible and allow for detailed examinations of tissue structure and composition at the microscopic level.
Recent advancements have utilized these histological images to predict spatially resolved gene expression profiles.
GeneQuery aims to solve this gene expression prediction task in a question-answering (QA) manner for better generality and flexibility.
arXiv Detail & Related papers (2024-11-27T14:33:13Z) - An Evolutional Neural Network Framework for Classification of Microarray Data [0.0]
This research aims to apply a hybrid model of Genetic Algorithm and Neural Network to overcome the problem during subset selection of informative genes.
Experimental results show the proposed method suggested high accuracy and minimum number of selected genes in comparison with other machine learning algorithms.
arXiv Detail & Related papers (2024-11-20T13:48:40Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity [3.972930262155919]
We propose a framework taking advantage of existing large models for gene vectorization to predict habitat specificity from entire microbial genome sequences.
We train and validate our approach on a large dataset of high quality microbiome genomes from different habitats.
arXiv Detail & Related papers (2024-05-09T09:34:51Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - Machine Learning Methods for Cancer Classification Using Gene Expression
Data: A Review [77.34726150561087]
Cancer is the second major cause of death after cardiovascular diseases.
Gene expression can play a fundamental role in the early detection of cancer.
This study reviews recent progress in gene expression analysis for cancer classification using machine learning methods.
arXiv Detail & Related papers (2023-01-28T15:03:03Z) - SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features
Learning from a Language Model [3.0643865202019698]
We propose a new solution named SemanticCAP to identify accessible regions of the genome.
It introduces a gene language model which models the context of gene sequences, thus being able to provide an effective representation of gene sequences.
Compared with other systems under public benchmarks, our model proved to have better performance.
arXiv Detail & Related papers (2022-04-05T11:47:58Z) - SimpleChrome: Encoding of Combinatorial Effects for Predicting Gene
Expression [8.326669256957352]
We present SimpleChrome, a deep learning model that learns the histone modification representations of genes.
The features learned from the model allow us to better understand the latent effects of cross-gene interactions and direct gene regulation on the target gene expression.
arXiv Detail & Related papers (2020-12-15T23:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.