RNAMunin: A Deep Machine Learning Model for Non-coding RNA Discovery
- URL: http://arxiv.org/abs/2507.11950v1
- Date: Wed, 16 Jul 2025 06:33:50 GMT
- Title: RNAMunin: A Deep Machine Learning Model for Non-coding RNA Discovery
- Authors: Lauren Lui, Torben Nielsen,
- Abstract summary: Non-coding RNAs (ncRNAs) are critical for regulating bacterial and archaeal physiology, stress response and metabolism.<n>This paper presents RNAMunin, a machine learning (ML) model that is capable of finding ncRNAs using genomic sequence alone.<n> RNAMunin is trained on Rfam sequences extracted from approximately 60 Gbp of long read metagenomes from 16 San Francisco Estuary samples.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Functional annotation of microbial genomes is often biased toward protein-coding genes, leaving a vast, unexplored landscape of non-coding RNAs (ncRNAs) that are critical for regulating bacterial and archaeal physiology, stress response and metabolism. Identifying ncRNAs directly from genomic sequence is a paramount challenge in bioinformatics and biology, essential for understanding the complete regulatory potential of an organism. This paper presents RNAMunin, a machine learning (ML) model that is capable of finding ncRNAs using genomic sequence alone. It is also computationally viable for large sequence datasets such as long read metagenomic assemblies with contigs totaling multiple Gbp. RNAMunin is trained on Rfam sequences extracted from approximately 60 Gbp of long read metagenomes from 16 San Francisco Estuary samples. We know of no other model that can detect ncRNAs based solely on genomic sequence at this scale. Since RNAMunin only requires genomic sequence as input, we do not need for an ncRNA to be transcribed to find it, i.e., we do not need transcriptomics data. We wrote this manuscript in a narrative style in order to best convey how RNAMunin was developed and how it works in detail. Unlike almost all current ML models, at approximately 1M parameters, RNAMunin is very small and very fast.
Related papers
- CircFormerMoE: An End-to-End Deep Learning Framework for Circular RNA Splice Site Detection and Pairing in Plant Genomes [0.0]
Circular RNAs (circRNAs) are important components of the non-coding RNA regulatory network.<n>We propose a deep learning framework named CircFormerMoE based on transformers and mixture-of experts for predicting circRNAs directly from plant genomic DNA.
arXiv Detail & Related papers (2025-07-11T12:43:17Z) - scMamba: A Pre-Trained Model for Single-Nucleus RNA Sequencing Analysis in Neurodegenerative Disorders [43.24785083027205]
scMamba is a pre-trained model designed to improve the quality and utility of snRNA-seq analysis.<n>Inspired by the recent Mamba model, scMamba introduces a novel architecture that incorporates a linear adapter layer, gene embeddings, and bidirectional Mamba blocks.<n>We demonstrate that scMamba outperforms benchmark methods in various downstream tasks, including cell type annotation, doublet detection, imputation, and the identification of differentially expressed genes.
arXiv Detail & Related papers (2025-02-12T11:48:22Z) - Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification [55.98854157265578]
Life-Code is a comprehensive framework that spans different biological functions.<n>We propose a unified pipeline to integrate multi-omics data by reverse-transcribing RNA and reverse-translating amino acids into nucleotide-based sequences.<n>Life-Code achieves state-of-the-art results on various tasks across three omics, highlighting its potential for advancing multi-omics analysis and interpretation.
arXiv Detail & Related papers (2025-02-11T06:53:59Z) - RNA-GPT: Multimodal Generative System for RNA Sequence Understanding [6.611255836269348]
RNAs are essential molecules that carry genetic information vital for life.
Despite this importance, RNA research is often hindered by the vast literature available on the topic.
We introduce RNA-GPT, a multi-modal RNA chat model designed to simplify RNA discovery.
arXiv Detail & Related papers (2024-10-29T06:19:56Z) - BEACON: Benchmark for Comprehensive RNA Tasks and Language Models [60.02663015002029]
We introduce the first comprehensive RNA benchmark BEACON (textbfBEnchmtextbfArk for textbfCOmprehensive RtextbfNA Task and Language Models).<n>First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications.<n>Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models.<n>Third, we investigate the vital RNA language model components
arXiv Detail & Related papers (2024-06-14T19:39:19Z) - RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks [1.1764999317813143]
We introduce RiboNucleic Acid Language Model (RiNALMo) to unveil the hidden code of RNA.
RiNALMo is the largest RNA language model to date, with 650M parameters pre-trained on 36M non-coding RNA sequences.
arXiv Detail & Related papers (2024-02-29T14:50:58Z) - Description Generation using Variational Auto-Encoders for precursor
microRNA [5.6710852973206105]
We propose a novel framework, which makes use of generative modeling through Vari Auto-Encoders to uncover latent factors of pre-miRNA.
Applying the framework to classification, we obtain a high reconstruction and classification performance, while also developing an accurate description.
arXiv Detail & Related papers (2023-11-29T15:41:45Z) - scHyena: Foundation Model for Full-Length Single-Cell RNA-Seq Analysis
in Brain [46.39828178736219]
We introduce scHyena, a foundation model designed to address these challenges and enhance the accuracy of scRNA-seq analysis in the brain.
scHyena is equipped with a linear adaptor layer, the positional encoding via gene-embedding, and a bidirectional Hyena operator.
This enables us to process full-length scRNA-seq data without losing any information from the raw data.
arXiv Detail & Related papers (2023-10-04T10:30:08Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - Accurate RNA 3D structure prediction using a language model-based deep learning approach [50.193512039121984]
RhoFold+ is an RNA language model-based deep learning method that accurately predicts 3D structures of single-chain RNAs from sequences.<n>RhoFold+ offers a fully automated end-to-end pipeline for RNA 3D structure prediction.
arXiv Detail & Related papers (2022-07-04T17:15:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.