A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and
Function Predictions
- URL: http://arxiv.org/abs/2310.03281v2
- Date: Fri, 6 Oct 2023 05:56:16 GMT
- Title: A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and
Function Predictions
- Authors: Yanyi Chu, Dan Yu, Yupeng Li, Kaixuan Huang, Yue Shen, Le Cong, Jason
Zhang, Mengdi Wang
- Abstract summary: The 5' UTR, a regulatory region at the beginning of an mRNA molecule, plays a crucial role in regulating the translation process.
Here, we introduce a language model for 5' UTR, which we refer to as the UTR-LM.
The model outperformed the best-known benchmark by up to 42% for predicting the Mean Ribosome Loading, and by up to 60% for predicting the Translation Efficiency and the mRNA Expression Level.
- Score: 39.54284059106283
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The 5' UTR, a regulatory region at the beginning of an mRNA molecule, plays a
crucial role in regulating the translation process and impacts the protein
expression level. Language models have showcased their effectiveness in
decoding the functions of protein and genome sequences. Here, we introduced a
language model for 5' UTR, which we refer to as the UTR-LM. The UTR-LM is
pre-trained on endogenous 5' UTRs from multiple species and is further
augmented with supervised information including secondary structure and minimum
free energy. We fine-tuned the UTR-LM in a variety of downstream tasks. The
model outperformed the best-known benchmark by up to 42% for predicting the
Mean Ribosome Loading, and by up to 60% for predicting the Translation
Efficiency and the mRNA Expression Level. The model also applies to identifying
unannotated Internal Ribosome Entry Sites within the untranslated region and
improves the AUPR from 0.37 to 0.52 compared to the best baseline. Further, we
designed a library of 211 novel 5' UTRs with high predicted values of
translation efficiency and evaluated them via a wet-lab assay. Experiment
results confirmed that our top designs achieved a 32.5% increase in protein
production level relative to well-established 5' UTR optimized for
therapeutics.
Related papers
- Helix-mRNA: A Hybrid Foundation Model For Full Sequence mRNA Therapeutics [3.2508287756500165]
mRNA-based vaccines have become a major focus in the pharmaceutical industry.
optimizing mRNA sequences for those properties remains a complex challenge.
We present Helix-mRNA, a structured state-space-based and attention hybrid model to address these challenges.
arXiv Detail & Related papers (2025-02-19T14:51:41Z) - LoRA-BERT: a Natural Language Processing Model for Robust and Accurate Prediction of long non-coding RNAs [11.346750562942345]
Long non-coding RNAs (lncRNAs) serve as crucial regulators in numerous biological processes.
Deep learning-based approaches have been introduced to classify lncRNAs.
LoRA-BERT is designed to capture the importance of nucleotide-level information during sequence classification.
arXiv Detail & Related papers (2024-11-11T22:17:01Z) - Training Compute-Optimal Protein Language Models [48.79416103951816]
Most protein language models are trained with extensive compute resources until performance gains plateau.
Our investigation is grounded in a massive dataset consisting of 939 million protein sequences.
We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens.
arXiv Detail & Related papers (2024-11-04T14:58:37Z) - Latent Diffusion Models for Controllable RNA Sequence Generation [33.38594748558547]
RNA is a key intermediary between DNA and protein, exhibiting high sequence diversity and complex three-dimensional structures.
We develop a latent diffusion model for generating and optimizing discrete RNA sequences of variable lengths.
Empirical results confirm that RNAdiffusion generates non-coding RNAs that align with natural distributions across various biological metrics.
arXiv Detail & Related papers (2024-09-15T19:04:50Z) - mRNA2vec: mRNA Embedding with Language Model in the 5'UTR-CDS for mRNA Design [0.4999814847776097]
This paper proposes a novel contextual language model (LM)-based embedding method: mRNA2vec.
In contrast to existing mRNA embedding approaches, our method is based on the self-supervised teacher-student learning framework of data2vec.
mRNA2vec demonstrates significant improvements in translation efficiency (TE) and expression level (EL) prediction tasks.
arXiv Detail & Related papers (2024-08-16T23:23:40Z) - BEACON: Benchmark for Comprehensive RNA Tasks and Language Models [60.02663015002029]
We introduce the first comprehensive RNA benchmark BEACON (textbfBEnchmtextbfArk for textbfCOmprehensive RtextbfNA Task and Language Models).
First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications.
Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models.
Third, we investigate the vital RNA language model components
arXiv Detail & Related papers (2024-06-14T19:39:19Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - Accurate RNA 3D structure prediction using a language model-based deep learning approach [50.193512039121984]
RhoFold+ is an RNA language model-based deep learning method that accurately predicts 3D structures of single-chain RNAs from sequences.
RhoFold+ offers a fully automated end-to-end pipeline for RNA 3D structure prediction.
arXiv Detail & Related papers (2022-07-04T17:15:35Z) - ProtTrans: Towards Cracking the Language of Life's Code Through
Self-Supervised Deep Learning and High Performance Computing [2.747785739760799]
Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP.
Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids.
For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information.
arXiv Detail & Related papers (2020-07-13T07:54:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.