Latent Diffusion Models for Controllable RNA Sequence Generation
- URL: http://arxiv.org/abs/2409.09828v2
- Date: Wed, 2 Oct 2024 16:42:46 GMT
- Title: Latent Diffusion Models for Controllable RNA Sequence Generation
- Authors: Kaixuan Huang, Yukang Yang, Kaidi Fu, Yanyi Chu, Le Cong, Mengdi Wang,
- Abstract summary: RNA is a key intermediary between DNA and protein, exhibiting high sequence diversity and complex three-dimensional structures.
We develop a latent diffusion model for generating and optimizing discrete RNA sequences of variable lengths.
Empirical results confirm that RNAdiffusion generates non-coding RNAs that align with natural distributions across various biological metrics.
- Score: 33.38594748558547
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work presents RNAdiffusion, a latent diffusion model for generating and optimizing discrete RNA sequences of variable lengths. RNA is a key intermediary between DNA and protein, exhibiting high sequence diversity and complex three-dimensional structures to support a wide range of functions. We utilize pretrained BERT-type models to encode raw RNA sequences into token-level, biologically meaningful representations. A Query Transformer is employed to compress such representations into a set of fixed-length latent vectors, with an autoregressive decoder trained to reconstruct RNA sequences from these latent variables. We then develop a continuous diffusion model within this latent space. To enable optimization, we integrate the gradients of reward models--surrogates for RNA functional properties--into the backward diffusion process, thereby generating RNAs with high reward scores. Empirical results confirm that RNAdiffusion generates non-coding RNAs that align with natural distributions across various biological metrics. Further, we fine-tune the diffusion model on mRNA 5' untranslated regions (5'-UTRs) and optimize sequences for high translation efficiencies. Our guided diffusion model effectively generates diverse 5'-UTRs with high Mean Ribosome Loading (MRL) and Translation Efficiency (TE), outperforming baselines in balancing rewards and structural stability trade-off. Our findings hold potential for advancing RNA sequence-function research and therapeutic RNA design.
Related papers
- Comprehensive benchmarking of large language models for RNA secondary structure prediction [0.0]
RNA-LLM uses large datasets of RNA sequences to learn, in a self-supervised way, how to represent each RNA base with a semantically rich numerical vector.
Among them, predicting the secondary structure is a fundamental task for uncovering RNA functional mechanisms.
We present a comprehensive experimental analysis of several pre-trained RNA-LLM, comparing them for the RNA secondary structure prediction task in a unified deep learning framework.
arXiv Detail & Related papers (2024-10-21T17:12:06Z) - Fine-Tuning Discrete Diffusion Models via Reward Optimization with Applications to DNA and Protein Design [56.957070405026194]
We propose an algorithm that enables direct backpropagation of rewards through entire trajectories generated by diffusion models.
DRAKES can generate sequences that are both natural-like and yield high rewards.
arXiv Detail & Related papers (2024-10-17T15:10:13Z) - RNACG: A Universal RNA Sequence Conditional Generation model based on Flow-Matching [0.0]
We develop a universal RNA sequence generation model based on flow matching, namely RNACG.
RNACG can accommodate various conditional inputs and is portable, enabling users to customize the encoding network for conditional inputs.
RNACG exhibits extensive applicability in sequence generation and property prediction tasks.
arXiv Detail & Related papers (2024-07-29T09:46:46Z) - BEACON: Benchmark for Comprehensive RNA Tasks and Language Models [60.02663015002029]
We introduce the first comprehensive RNA benchmark BEACON (textbfBEnchmtextbfArk for textbfCOmprehensive RtextbfNA Task and Language Models).
First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications.
Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models.
Third, we investigate the vital RNA language model components
arXiv Detail & Related papers (2024-06-14T19:39:19Z) - RNAFlow: RNA Structure & Sequence Design via Inverse Folding-Based Flow Matching [7.600990806121113]
RNAFlow is a flow matching model for protein-conditioned RNA sequence-structure design.
Its denoising network integrates an RNA inverse folding model and a pre-trained RosettaFold2NA network for generation of RNA sequences and structures.
arXiv Detail & Related papers (2024-05-29T05:10:25Z) - scRDiT: Generating single-cell RNA-seq data by diffusion transformers and accelerating sampling [9.013834280011293]
Single-cell RNA sequencing (scRNA-seq) is a groundbreaking technology extensively utilized in biological research.
Our study introduces a generative approach termed scRNA-seq Diffusion Transformer (scRDiT)
This method generates virtual scRNA-seq data by leveraging a real dataset.
arXiv Detail & Related papers (2024-04-09T09:25:16Z) - scHyena: Foundation Model for Full-Length Single-Cell RNA-Seq Analysis
in Brain [46.39828178736219]
We introduce scHyena, a foundation model designed to address these challenges and enhance the accuracy of scRNA-seq analysis in the brain.
scHyena is equipped with a linear adaptor layer, the positional encoding via gene-embedding, and a bidirectional Hyena operator.
This enables us to process full-length scRNA-seq data without losing any information from the raw data.
arXiv Detail & Related papers (2023-10-04T10:30:08Z) - RDesign: Hierarchical Data-efficient Representation Learning for
Tertiary Structure-based RNA Design [65.41144149958208]
This study aims to systematically construct a data-driven RNA design pipeline.
We crafted a benchmark dataset and designed a comprehensive structural modeling approach to represent the complex RNA tertiary structure.
We incorporated extracted secondary structures with base pairs as prior knowledge to facilitate the RNA design process.
arXiv Detail & Related papers (2023-01-25T17:19:49Z) - Improving RNA Secondary Structure Design using Deep Reinforcement
Learning [69.63971634605797]
We propose a new benchmark of applying reinforcement learning to RNA sequence design, in which the objective function is defined to be the free energy in the sequence's secondary structure.
We show results of the ablation analysis that we do for these algorithms, as well as graphs indicating the algorithm's performance across batches.
arXiv Detail & Related papers (2021-11-05T02:54:06Z) - Classification of Long Noncoding RNA Elements Using Deep Convolutional
Neural Networks and Siamese Networks [17.8181080354116]
This thesis proposes a new methodemploying deep convolutional neural networks (CNNs) to classifyncRNA sequences.
As a result, clas-sifying RNA sequences is converted to an image classificationproblem that can be efficiently solved by CNN-basedclassification models.
arXiv Detail & Related papers (2021-02-10T17:26:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.