Predictive models of RNA degradation through dual crowdsourcing
- URL: http://arxiv.org/abs/2110.07531v1
- Date: Thu, 14 Oct 2021 16:50:37 GMT
- Title: Predictive models of RNA degradation through dual crowdsourcing
- Authors: Hannah K. Wayment-Steele, Wipapat Kladwang, Andrew M. Watkins, Do Soon
Kim, Bojan Tunguz, Walter Reade, Maggie Temkin, Jonathan Romano, Roger
Wellington-Oguri, John J. Nicol, Jiayang Gao, Kazuki Onodera, Kazuki
Fujikawa, Hanfei Mao, Gilles Vandewiele, Michele Tinti, Bram Steenwinckel,
Takuya Ito, Taiga Noumi, Shujun He, Keiichiro Ishi, Youhan Lee, Fatih
\"Ozt\"urk, Anthony Chiu, Emin \"Ozt\"urk, Karim Amer, Mohamed Fares, Eterna
Participants, Rhiju Das
- Abstract summary: We describe a crowdsourced machine learning competition ("Stanford OpenVaccine") on Kaggle.
Winning models demonstrated test set errors that were better by 50% than the previous state-of-the-art DegScore model.
- Score: 2.003083111563343
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Messenger RNA-based medicines hold immense potential, as evidenced by their
rapid deployment as COVID-19 vaccines. However, worldwide distribution of mRNA
molecules has been limited by their thermostability, which is fundamentally
limited by the intrinsic instability of RNA molecules to a chemical degradation
reaction called in-line hydrolysis. Predicting the degradation of an RNA
molecule is a key task in designing more stable RNA-based therapeutics. Here,
we describe a crowdsourced machine learning competition ("Stanford
OpenVaccine") on Kaggle, involving single-nucleotide resolution measurements on
6043 102-130-nucleotide diverse RNA constructs that were themselves solicited
through crowdsourcing on the RNA design platform Eterna. The entire experiment
was completed in less than 6 months. Winning models demonstrated test set
errors that were better by 50% than the previous state-of-the-art DegScore
model. Furthermore, these models generalized to blindly predicting orthogonal
degradation data on much longer mRNA molecules (504-1588 nucleotides) with
improved accuracy over DegScore and other models. Top teams integrated natural
language processing architectures and data augmentation techniques with
predictions from previous dynamic programming models for RNA secondary
structure. These results indicate that such models are capable of representing
in-line hydrolysis with excellent accuracy, supporting their use for designing
stabilized messenger RNAs. The integration of two crowdsourcing platforms, one
for data set creation and another for machine learning, may be fruitful for
other urgent problems that demand scientific discovery on rapid timescales.
Related papers
- Character-level Tokenizations as Powerful Inductive Biases for RNA Foundational Models [0.0]
understanding and predicting RNA behavior is a challenge due to the complexity of RNA structures and interactions.
Current RNA models have yet to match the performance observed in the protein domain.
ChaRNABERT is able to reach state-of-the-art performance on several tasks in established benchmarks.
arXiv Detail & Related papers (2024-11-05T21:56:16Z) - Predicting Distance matrix with large language models [1.8855270809505869]
RNA structure prediction remains a significant challenge due to data limitations.
Traditional methods such as nuclear magnetic resonance spectroscopy, Xray crystallography, and electron microscopy are expensive and time consuming.
Distance maps provide a simplified representation of spatial constraints between nucleotides, capturing essential relationships without requiring a full 3D model.
arXiv Detail & Related papers (2024-09-24T10:28:55Z) - BEACON: Benchmark for Comprehensive RNA Tasks and Language Models [60.02663015002029]
We introduce the first comprehensive RNA benchmark BEACON (textbfBEnchmtextbfArk for textbfCOmprehensive RtextbfNA Task and Language Models).
First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications.
Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models.
Third, we investigate the vital RNA language model components
arXiv Detail & Related papers (2024-06-14T19:39:19Z) - Regressor-free Molecule Generation to Support Drug Response Prediction [83.25894107956735]
Conditional generation based on the target IC50 score can obtain a more effective sampling space.
Regressor-free guidance combines a diffusion model's score estimation with a regression controller model's gradient based on number labels.
arXiv Detail & Related papers (2024-05-23T13:22:17Z) - RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks [1.1764999317813143]
We introduce RiboNucleic Acid Language Model (RiNALMo) to unveil the hidden code of RNA.
RiNALMo is the largest RNA language model to date, with 650M parameters pre-trained on 36M non-coding RNA sequences.
arXiv Detail & Related papers (2024-02-29T14:50:58Z) - Machine Learning Modeling Of SiRNA Structure-Potency Relationship With
Applications Against Sars-Cov-2 Spike Gene [0.0]
Drug discovery process is lengthy and costly, taking nearly a decade to bring a new drug to the market.
Biotechnology, computational methods, and machine learning algorithms have the potential to revolutionize drug discovery, speeding up the process and improving patient outcomes.
The COVID-19 pandemic has further accelerated and deepened the recognition of the potential of these techniques, especially in the areas of drug repurposing and efficacy predictions.
arXiv Detail & Related papers (2024-01-18T23:00:34Z) - scHyena: Foundation Model for Full-Length Single-Cell RNA-Seq Analysis
in Brain [46.39828178736219]
We introduce scHyena, a foundation model designed to address these challenges and enhance the accuracy of scRNA-seq analysis in the brain.
scHyena is equipped with a linear adaptor layer, the positional encoding via gene-embedding, and a bidirectional Hyena operator.
This enables us to process full-length scRNA-seq data without losing any information from the raw data.
arXiv Detail & Related papers (2023-10-04T10:30:08Z) - RDesign: Hierarchical Data-efficient Representation Learning for
Tertiary Structure-based RNA Design [65.41144149958208]
This study aims to systematically construct a data-driven RNA design pipeline.
We crafted a benchmark dataset and designed a comprehensive structural modeling approach to represent the complex RNA tertiary structure.
We incorporated extracted secondary structures with base pairs as prior knowledge to facilitate the RNA design process.
arXiv Detail & Related papers (2023-01-25T17:19:49Z) - E2Efold-3D: End-to-End Deep Learning Method for accurate de novo RNA 3D
Structure Prediction [46.38735421190187]
We develop the first end-to-end deep learning approach, E2Efold-3D, to accurately perform the textitde novo RNA structure prediction.
Several novel components are proposed to overcome the data scarcity, such as a fully-differentiable end-to-end pipeline, secondary structure-assisted self-distillation, and parameter-efficient backbone formulation.
arXiv Detail & Related papers (2022-07-04T17:15:35Z) - Predicting Hydroxyl Mediated Nucleophilic Degradation and Molecular
Stability of RNA Sequences through the Application of Deep Learning Methods [0.0]
This paper proposes and evaluates three deep learning models as methods to predict the reactivity and risk of degradation of mRNA sequences.
The Stanford Open Vaccine dataset of 6034 mRNA sequences was used in this study.
Results suggest these models can be applied to understand and predict the chemical stability of mRNA in the near future.
arXiv Detail & Related papers (2020-11-09T10:42:53Z) - A Systematic Approach to Featurization for Cancer Drug Sensitivity
Predictions with Deep Learning [49.86828302591469]
We train >35,000 neural network models, sweeping over common featurization techniques.
We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features.
arXiv Detail & Related papers (2020-04-30T20:42:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.