RiNALMo: General-Purpose RNA Language Models Can Generalize Well on
Structure Prediction Tasks
- URL: http://arxiv.org/abs/2403.00043v1
- Date: Thu, 29 Feb 2024 14:50:58 GMT
- Title: RiNALMo: General-Purpose RNA Language Models Can Generalize Well on
Structure Prediction Tasks
- Authors: Rafael Josip Peni\'c, Tin Vla\v{s}i\'c, Roland G. Huber, Yue Wan, Mile
\v{S}iki\'c
- Abstract summary: We introduce RiboNucleic Acid Language Model (RiNALMo) to help unveil the hidden code of RNA.
RiNALMo is the largest RNA language model to date with $650$ million parameters pre-trained on $36$ million non-coding RNA sequences.
- Score: 1.2466379414976048
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ribonucleic acid (RNA) plays a variety of crucial roles in fundamental
biological processes. Recently, RNA has become an interesting drug target,
emphasizing the need to improve our understanding of its structures and
functions. Over the years, sequencing technologies have produced an enormous
amount of unlabeled RNA data, which hides important knowledge and potential.
Motivated by the successes of protein language models, we introduce RiboNucleic
Acid Language Model (RiNALMo) to help unveil the hidden code of RNA. RiNALMo is
the largest RNA language model to date with $650$ million parameters
pre-trained on $36$ million non-coding RNA sequences from several available
databases. RiNALMo is able to extract hidden knowledge and capture the
underlying structure information implicitly embedded within the RNA sequences.
RiNALMo achieves state-of-the-art results on several downstream tasks. Notably,
we show that its generalization capabilities can overcome the inability of
other deep learning methods for secondary structure prediction to generalize on
unseen RNA families. The code has been made publicly available on
https://github.com/lbcb-sci/RiNALMo.
Related papers
- Comprehensive benchmarking of large language models for RNA secondary structure prediction [0.0]
RNA-LLM uses large datasets of RNA sequences to learn, in a self-supervised way, how to represent each RNA base with a semantically rich numerical vector.
Among them, predicting the secondary structure is a fundamental task for uncovering RNA functional mechanisms.
We present a comprehensive experimental analysis of several pre-trained RNA-LLM, comparing them for the RNA secondary structure prediction task in a unified deep learning framework.
arXiv Detail & Related papers (2024-10-21T17:12:06Z) - RNA-FrameFlow: Flow Matching for de novo 3D RNA Backbone Design [35.66059762160962]
We introduce RNA-FrameFlow, the first generative model for 3D RNA backbone design.
We formulate RNA structures as a set of rigid-body frames and associated loss functions.
To tackle the lack of diversity in 3D RNA datasets, we explore training with structural clustering and cropping augmentations.
arXiv Detail & Related papers (2024-06-19T21:06:44Z) - BEACON: Benchmark for Comprehensive RNA Tasks and Language Models [60.02663015002029]
We introduce the first comprehensive RNA benchmark BEACON (textbfBEnchmtextbfArk for textbfCOmprehensive RtextbfNA Task and Language Models).
First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications.
Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models.
Third, we investigate the vital RNA language model components
arXiv Detail & Related papers (2024-06-14T19:39:19Z) - Description Generation using Variational Auto-Encoders for precursor
microRNA [5.6710852973206105]
We propose a novel framework, which makes use of generative modeling through Vari Auto-Encoders to uncover latent factors of pre-miRNA.
Applying the framework to classification, we obtain a high reconstruction and classification performance, while also developing an accurate description.
arXiv Detail & Related papers (2023-11-29T15:41:45Z) - scHyena: Foundation Model for Full-Length Single-Cell RNA-Seq Analysis
in Brain [46.39828178736219]
We introduce scHyena, a foundation model designed to address these challenges and enhance the accuracy of scRNA-seq analysis in the brain.
scHyena is equipped with a linear adaptor layer, the positional encoding via gene-embedding, and a bidirectional Hyena operator.
This enables us to process full-length scRNA-seq data without losing any information from the raw data.
arXiv Detail & Related papers (2023-10-04T10:30:08Z) - Knowledge from Large-Scale Protein Contact Prediction Models Can Be
Transferred to the Data-Scarce RNA Contact Prediction Task [40.051834115537474]
We find that a protein-coevolution Transformer-based deep neural network can be transferred to the RNA contact prediction task.
Experiments confirm that RNA contact prediction through transfer learning is greatly improved.
Our findings indicate that the learned structural patterns of proteins can be transferred to RNAs, opening up potential new avenues for research.
arXiv Detail & Related papers (2023-02-13T06:00:56Z) - RDesign: Hierarchical Data-efficient Representation Learning for
Tertiary Structure-based RNA Design [65.41144149958208]
This study aims to systematically construct a data-driven RNA design pipeline.
We crafted a benchmark dataset and designed a comprehensive structural modeling approach to represent the complex RNA tertiary structure.
We incorporated extracted secondary structures with base pairs as prior knowledge to facilitate the RNA design process.
arXiv Detail & Related papers (2023-01-25T17:19:49Z) - E2Efold-3D: End-to-End Deep Learning Method for accurate de novo RNA 3D
Structure Prediction [46.38735421190187]
We develop the first end-to-end deep learning approach, E2Efold-3D, to accurately perform the textitde novo RNA structure prediction.
Several novel components are proposed to overcome the data scarcity, such as a fully-differentiable end-to-end pipeline, secondary structure-assisted self-distillation, and parameter-efficient backbone formulation.
arXiv Detail & Related papers (2022-07-04T17:15:35Z) - Predictive models of RNA degradation through dual crowdsourcing [2.003083111563343]
We describe a crowdsourced machine learning competition ("Stanford OpenVaccine") on Kaggle.
Winning models demonstrated test set errors that were better by 50% than the previous state-of-the-art DegScore model.
arXiv Detail & Related papers (2021-10-14T16:50:37Z) - A Systematic Approach to Featurization for Cancer Drug Sensitivity
Predictions with Deep Learning [49.86828302591469]
We train >35,000 neural network models, sweeping over common featurization techniques.
We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features.
arXiv Detail & Related papers (2020-04-30T20:42:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.