Related papers: ProtiGeno: a prokaryotic short gene finder using protein language models

ProtiGeno: a prokaryotic short gene finder using protein language models

URL: http://arxiv.org/abs/2307.10343v1
Date: Wed, 19 Jul 2023 16:46:42 GMT
Title: ProtiGeno: a prokaryotic short gene finder using protein language models
Authors: Tony Tu, Gautham Krishna, Amirali Aghazadeh
Abstract summary: Current gene finders are highly sensitive in finding long genes, but their sensitivity decreases noticeably in finding shorter genes. We develop a deep learning-based method called ProtiGeno, specifically targeting short prokaryotic genes. In systematic large-scale experiments on 4,288 prokaryotic genomes, we demonstrate that ProtiGeno predicts short coding and noncoding genes with higher accuracy and recall than the current state-of-the-art gene finders.
Score: 1.2354076490479513
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Prokaryotic gene prediction plays an important role in understanding the biology of organisms and their function with applications in medicine and biotechnology. Although the current gene finders are highly sensitive in finding long genes, their sensitivity decreases noticeably in finding shorter genes (<180 nts). The culprit is insufficient annotated gene data to identify distinguishing features in short open reading frames (ORFs). We develop a deep learning-based method called ProtiGeno, specifically targeting short prokaryotic genes using a protein language model trained on millions of evolved proteins. In systematic large-scale experiments on 4,288 prokaryotic genomes, we demonstrate that ProtiGeno predicts short coding and noncoding genes with higher accuracy and recall than the current state-of-the-art gene finders. We discuss the predictive features of ProtiGeno and possible limitations by visualizing the three-dimensional structure of the predicted short genes. Data, codes, and models are available at https://github.com/tonytu16/protigeno.

Related papers

GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype [51.58774936662233]
Building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations.<n>In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data.<n>We introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes.
arXiv Detail & Related papers (2025-05-06T03:35:24Z)
Learning to Discover Regulatory Elements for Gene Expression Prediction [59.470991831978516]
Seq2Exp is a Sequence to Expression network designed to discover and extract regulatory elements that drive target gene expression. Our approach captures the causal relationship between epigenomic signals, DNA sequences and their associated regulatory elements.
arXiv Detail & Related papers (2025-02-19T03:25:49Z)
GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z)
G2PDiffusion: Cross-Species Genotype-to-Phenotype Prediction via Evolutionary Diffusion [108.94237816552024]
We propose the first genotype-to-phenotype diffusion model (G2PDiffusion) that generates morphological images from DNA. The model contains three novel components: 1) a MSA retrieval engine that identifies conserved and co-evolutionary patterns; 2) an environment-aware MSA conditional encoder that effectively models complex genotype-environment interactions; and 3) an adaptive phenomic alignment module to improve genotype-phenotype consistency.
arXiv Detail & Related papers (2025-02-07T06:16:31Z)
GeneQuery: A General QA-based Framework for Spatial Gene Expression Predictions from Histology Images [41.732831871866516]
Whole-slide hematoxylin and eosin stained histological images are readily accessible and allow for detailed examinations of tissue structure and composition at the microscopic level. Recent advancements have utilized these histological images to predict spatially resolved gene expression profiles. GeneQuery aims to solve this gene expression prediction task in a question-answering (QA) manner for better generality and flexibility.
arXiv Detail & Related papers (2024-11-27T14:33:13Z)
An Evolutional Neural Network Framework for Classification of Microarray Data [0.0]
This research aims to apply a hybrid model of Genetic Algorithm and Neural Network to overcome the problem during subset selection of informative genes. Experimental results show the proposed method suggested high accuracy and minimum number of selected genes in comparison with other machine learning algorithms.
arXiv Detail & Related papers (2024-11-20T13:48:40Z)
Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification [119.13058298388101]
We develop a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances. BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules. BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules.
arXiv Detail & Related papers (2024-06-05T06:42:27Z)
BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments [112.25067497985447]
We introduce BioDiscoveryAgent, an agent that designs new experiments, reasons about their outcomes, and efficiently navigates the hypothesis space to reach desired solutions. BioDiscoveryAgent can uniquely design new experiments without the need to train a machine learning model. It achieves an average of 21% improvement in predicting relevant genetic perturbations across six datasets.
arXiv Detail & Related papers (2024-05-27T19:57:17Z)
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z)
Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity [3.972930262155919]
We propose a framework taking advantage of existing large models for gene vectorization to predict habitat specificity from entire microbial genome sequences. We train and validate our approach on a large dataset of high quality microbiome genomes from different habitats.
arXiv Detail & Related papers (2024-05-09T09:34:51Z)
Efficient and Scalable Fine-Tune of Language Models for Genome Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes. Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues. textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z)
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z)
Machine Learning Methods for Cancer Classification Using Gene Expression Data: A Review [77.34726150561087]
Cancer is the second major cause of death after cardiovascular diseases. Gene expression can play a fundamental role in the early detection of cancer. This study reviews recent progress in gene expression analysis for cancer classification using machine learning methods.
arXiv Detail & Related papers (2023-01-28T15:03:03Z)
SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model [3.0643865202019698]
We propose a new solution named SemanticCAP to identify accessible regions of the genome. It introduces a gene language model which models the context of gene sequences, thus being able to provide an effective representation of gene sequences. Compared with other systems under public benchmarks, our model proved to have better performance.
arXiv Detail & Related papers (2022-04-05T11:47:58Z)
Feature extraction using Spectral Clustering for Gene Function Prediction [0.4492444446637856]
This paper presents a novel in silico approach for to the annotation problem that combines cluster analysis and hierarchical multi-label classification. The proposed approach is applied to a case study on Zea mays, one of the most dominant and productive crops in the world.
arXiv Detail & Related papers (2022-03-25T10:17:36Z)
SimpleChrome: Encoding of Combinatorial Effects for Predicting Gene Expression [8.326669256957352]
We present SimpleChrome, a deep learning model that learns the histone modification representations of genes. The features learned from the model allow us to better understand the latent effects of cross-gene interactions and direct gene regulation on the target gene expression.
arXiv Detail & Related papers (2020-12-15T23:30:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.