mRNA2vec: mRNA Embedding with Language Model in the 5'UTR-CDS for mRNA Design
- URL: http://arxiv.org/abs/2408.09048v2
- Date: Thu, 19 Dec 2024 22:38:35 GMT
- Title: mRNA2vec: mRNA Embedding with Language Model in the 5'UTR-CDS for mRNA Design
- Authors: Honggen Zhang, Xiangrui Gao, June Zhang, Lipeng Lai,
- Abstract summary: This paper proposes a novel contextual language model (LM)-based embedding method: mRNA2vec.
In contrast to existing mRNA embedding approaches, our method is based on the self-supervised teacher-student learning framework of data2vec.
mRNA2vec demonstrates significant improvements in translation efficiency (TE) and expression level (EL) prediction tasks.
- Score: 0.4999814847776097
- License:
- Abstract: Messenger RNA (mRNA)-based vaccines are accelerating the discovery of new drugs and revolutionizing the pharmaceutical industry. However, selecting particular mRNA sequences for vaccines and therapeutics from extensive mRNA libraries is costly. Effective mRNA therapeutics require carefully designed sequences with optimized expression levels and stability. This paper proposes a novel contextual language model (LM)-based embedding method: mRNA2vec. In contrast to existing mRNA embedding approaches, our method is based on the self-supervised teacher-student learning framework of data2vec. We jointly use the 5' untranslated region (UTR) and coding sequence (CDS) region as the input sequences. We adapt our LM-based approach specifically to mRNA by 1) considering the importance of location on the mRNA sequence with probabilistic masking, 2) using Minimum Free Energy (MFE) prediction and Secondary Structure (SS) classification as additional pretext tasks. mRNA2vec demonstrates significant improvements in translation efficiency (TE) and expression level (EL) prediction tasks in UTR compared to SOTA methods such as UTR-LM. It also gives a competitive performance in mRNA stability and protein production level tasks in CDS such as CodonBERT.
Related papers
- Helix-mRNA: A Hybrid Foundation Model For Full Sequence mRNA Therapeutics [3.2508287756500165]
mRNA-based vaccines have become a major focus in the pharmaceutical industry.
optimizing mRNA sequences for those properties remains a complex challenge.
We present Helix-mRNA, a structured state-space-based and attention hybrid model to address these challenges.
arXiv Detail & Related papers (2025-02-19T14:51:41Z) - LoRA-BERT: a Natural Language Processing Model for Robust and Accurate Prediction of long non-coding RNAs [11.346750562942345]
Long non-coding RNAs (lncRNAs) serve as crucial regulators in numerous biological processes.
Deep learning-based approaches have been introduced to classify lncRNAs.
LoRA-BERT is designed to capture the importance of nucleotide-level information during sequence classification.
arXiv Detail & Related papers (2024-11-11T22:17:01Z) - RNA-GPT: Multimodal Generative System for RNA Sequence Understanding [6.611255836269348]
RNAs are essential molecules that carry genetic information vital for life.
Despite this importance, RNA research is often hindered by the vast literature available on the topic.
We introduce RNA-GPT, a multi-modal RNA chat model designed to simplify RNA discovery.
arXiv Detail & Related papers (2024-10-29T06:19:56Z) - HELM: Hierarchical Encoding for mRNA Language Modeling [4.990962434274757]
We introduce Hierarchical generative approaches for mRNA Language Modeling (HELM)
HELM modulates the loss function based on codon synonymity, aligning the model's learning process with the biological reality of mRNA sequences.
We evaluate HELM on diverse mRNA datasets and tasks, demonstrating that HELM outperforms standard language model pre-training.
arXiv Detail & Related papers (2024-10-16T11:16:47Z) - BEACON: Benchmark for Comprehensive RNA Tasks and Language Models [60.02663015002029]
We introduce the first comprehensive RNA benchmark BEACON (textbfBEnchmtextbfArk for textbfCOmprehensive RtextbfNA Task and Language Models).
First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications.
Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models.
Third, we investigate the vital RNA language model components
arXiv Detail & Related papers (2024-06-14T19:39:19Z) - A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and
Function Predictions [39.54284059106283]
The 5' UTR, a regulatory region at the beginning of an mRNA molecule, plays a crucial role in regulating the translation process.
Here, we introduce a language model for 5' UTR, which we refer to as the UTR-LM.
The model outperformed the best-known benchmark by up to 42% for predicting the Mean Ribosome Loading, and by up to 60% for predicting the Translation Efficiency and the mRNA Expression Level.
arXiv Detail & Related papers (2023-10-05T03:15:01Z) - scHyena: Foundation Model for Full-Length Single-Cell RNA-Seq Analysis
in Brain [46.39828178736219]
We introduce scHyena, a foundation model designed to address these challenges and enhance the accuracy of scRNA-seq analysis in the brain.
scHyena is equipped with a linear adaptor layer, the positional encoding via gene-embedding, and a bidirectional Hyena operator.
This enables us to process full-length scRNA-seq data without losing any information from the raw data.
arXiv Detail & Related papers (2023-10-04T10:30:08Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Accurate RNA 3D structure prediction using a language model-based deep learning approach [50.193512039121984]
RhoFold+ is an RNA language model-based deep learning method that accurately predicts 3D structures of single-chain RNAs from sequences.
RhoFold+ offers a fully automated end-to-end pipeline for RNA 3D structure prediction.
arXiv Detail & Related papers (2022-07-04T17:15:35Z) - Unsupervised Bitext Mining and Translation via Self-trained Contextual
Embeddings [51.47607125262885]
We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text.
We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training.
We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods.
arXiv Detail & Related papers (2020-10-15T14:04:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.