Related papers: BEACON: Benchmark for Comprehensive RNA Tasks and Language Models

BEACON: Benchmark for Comprehensive RNA Tasks and Language Models

URL: http://arxiv.org/abs/2406.10391v1
Date: Fri, 14 Jun 2024 19:39:19 GMT
Title: BEACON: Benchmark for Comprehensive RNA Tasks and Language Models
Authors: Yuchen Ren, Zhiyuan Chen, Lifeng Qiao, Hongtai Jing, Yuchen Cai, Sheng Xu, Peng Ye, Xinzhu Ma, Siqi Sun, Hongliang Yan, Dong Yuan, Wanli Ouyang, Xihui Liu,
Abstract summary: We introduce the first comprehensive RNA benchmark BEACON (textbfBEnchmtextbfArk for textbfCOmprehensive RtextbfNA Task and Language Models). First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications. Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models. Third, we investigate the vital RNA language model components
Score: 60.02663015002029
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: RNA plays a pivotal role in translating genetic instructions into functional outcomes, underscoring its importance in biological processes and disease mechanisms. Despite the emergence of numerous deep learning approaches for RNA, particularly universal RNA language models, there remains a significant lack of standardized benchmarks to assess the effectiveness of these methods. In this study, we introduce the first comprehensive RNA benchmark BEACON (\textbf{BE}nchm\textbf{A}rk for \textbf{CO}mprehensive R\textbf{N}A Task and Language Models). First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications, enabling a comprehensive assessment of the performance of methods on various RNA understanding tasks. Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models. Third, we investigate the vital RNA language model components from the tokenizer and positional encoding aspects. Notably, our findings emphasize the superiority of single nucleotide tokenization and the effectiveness of Attention with Linear Biases (ALiBi) over traditional positional encoding methods. Based on these insights, a simple yet strong baseline called BEACON-B is proposed, which can achieve outstanding performance with limited data and computational resources. The datasets and source code of our benchmark are available at https://github.com/terry-r123/RNABenchmark.

Related papers

Regulatory DNA sequence Design with Reinforcement Learning [56.20290878358356]
We propose a generative approach that leverages reinforcement learning to fine-tune a pre-trained autoregressive model. We evaluate our method on promoter design tasks in two yeast media conditions and enhancer design tasks for three human cell types.
arXiv Detail & Related papers (2025-03-11T02:33:33Z)
Helix-mRNA: A Hybrid Foundation Model For Full Sequence mRNA Therapeutics [3.2508287756500165]
mRNA-based vaccines have become a major focus in the pharmaceutical industry. optimizing mRNA sequences for those properties remains a complex challenge. We present Helix-mRNA, a structured state-space-based and attention hybrid model to address these challenges.
arXiv Detail & Related papers (2025-02-19T14:51:41Z)
Character-level Tokenizations as Powerful Inductive Biases for RNA Foundational Models [0.0]
understanding and predicting RNA behavior is a challenge due to the complexity of RNA structures and interactions. Current RNA models have yet to match the performance observed in the protein domain. ChaRNABERT is able to reach state-of-the-art performance on several tasks in established benchmarks.
arXiv Detail & Related papers (2024-11-05T21:56:16Z)
Comprehensive benchmarking of large language models for RNA secondary structure prediction [0.0]
RNA-LLM uses large datasets of RNA sequences to learn, in a self-supervised way, how to represent each RNA base with a semantically rich numerical vector. Among them, predicting the secondary structure is a fundamental task for uncovering RNA functional mechanisms. We present a comprehensive experimental analysis of several pre-trained RNA-LLM, comparing them for the RNA secondary structure prediction task in a unified deep learning framework.
arXiv Detail & Related papers (2024-10-21T17:12:06Z)
Beyond Sequence: Impact of Geometric Context for RNA Property Prediction [6.559586725997741]
RNA structures can be represented as 1D sequences, 2D topological graphs, or 3D all-atom models.<n>Existing works predominantly focus on 1D sequence-based models, which overlook the geometric context provided by 2D and 3D geometries.<n>This study presents the first systematic evaluation of incorporating explicit 2D and 3D geometric information into RNA property prediction.
arXiv Detail & Related papers (2024-10-15T17:09:34Z)
RNACG: A Universal RNA Sequence Conditional Generation model based on Flow-Matching [0.0]
We develop a universal RNA sequence generation model based on flow matching, namely RNACG. RNACG can accommodate various conditional inputs and is portable, enabling users to customize the encoding network for conditional inputs. RNACG exhibits extensive applicability in sequence generation and property prediction tasks.
arXiv Detail & Related papers (2024-07-29T09:46:46Z)
RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks [1.1764999317813143]
We introduce RiboNucleic Acid Language Model (RiNALMo) to unveil the hidden code of RNA. RiNALMo is the largest RNA language model to date, with 650M parameters pre-trained on 36M non-coding RNA sequences.
arXiv Detail & Related papers (2024-02-29T14:50:58Z)
RDesign: Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design [65.41144149958208]
This study aims to systematically construct a data-driven RNA design pipeline. We crafted a benchmark dataset and designed a comprehensive structural modeling approach to represent the complex RNA tertiary structure. We incorporated extracted secondary structures with base pairs as prior knowledge to facilitate the RNA design process.
arXiv Detail & Related papers (2023-01-25T17:19:49Z)
Neural-Symbolic Recursive Machine for Systematic Generalization [113.22455566135757]
We introduce the Neural-Symbolic Recursive Machine (NSR), whose core is a Grounded Symbol System (GSS) NSR integrates neural perception, syntactic parsing, and semantic reasoning. We evaluate NSR's efficacy across four challenging benchmarks designed to probe systematic generalization capabilities.
arXiv Detail & Related papers (2022-10-04T13:27:38Z)
E2Efold-3D: End-to-End Deep Learning Method for accurate de novo RNA 3D Structure Prediction [46.38735421190187]
We develop the first end-to-end deep learning approach, E2Efold-3D, to accurately perform the textitde novo RNA structure prediction. Several novel components are proposed to overcome the data scarcity, such as a fully-differentiable end-to-end pipeline, secondary structure-assisted self-distillation, and parameter-efficient backbone formulation.
arXiv Detail & Related papers (2022-07-04T17:15:35Z)
Classification of Long Noncoding RNA Elements Using Deep Convolutional Neural Networks and Siamese Networks [17.8181080354116]
This thesis proposes a new methodemploying deep convolutional neural networks (CNNs) to classifyncRNA sequences. As a result, clas-sifying RNA sequences is converted to an image classificationproblem that can be efficiently solved by CNN-basedclassification models.
arXiv Detail & Related papers (2021-02-10T17:26:38Z)
Syntax Role for Neural Semantic Role Labeling [77.5166510071142]
Semantic role labeling (SRL) is dedicated to recognizing the semantic predicate-argument structure of a sentence. Previous studies in terms of traditional models have shown syntactic information can make remarkable contributions to SRL performance. Recent neural SRL studies show that syntax information becomes much less important for neural semantic role labeling.
arXiv Detail & Related papers (2020-09-12T07:01:12Z)
RNA Secondary Structure Prediction By Learning Unrolled Algorithms [70.09461537906319]
In this paper, we propose an end-to-end deep learning model, called E2Efold, for RNA secondary structure prediction. The key idea of E2Efold is to directly predict the RNA base-pairing matrix, and use an unrolled algorithm for constrained programming as the template for deep architectures to enforce constraints. With comprehensive experiments on benchmark datasets, we demonstrate the superior performance of E2Efold.
arXiv Detail & Related papers (2020-02-13T23:21:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.