A novel RNA pseudouridine site prediction model using Utility Kernel and
data-driven parameters
- URL: http://arxiv.org/abs/2311.16132v1
- Date: Thu, 2 Nov 2023 08:32:10 GMT
- Title: A novel RNA pseudouridine site prediction model using Utility Kernel and
data-driven parameters
- Authors: Sourabh Patil, Archana Mathur, Raviprasad Aduri, Snehanshu Saha
- Abstract summary: Pseudouridine is the most frequent modification in RNA.
Existing models to predict the pseudouridine sites in a given RNA sequence mainly depend on user-defined features.
We propose a Support Vector Machine (SVM) Kernel based on utility theory from Economics.
- Score: 0.7373617024876725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: RNA protein Interactions (RPIs) play an important role in biological systems.
Recently, we have enumerated the RPIs at the residue level and have elucidated
the minimum structural unit (MSU) in these interactions to be a stretch of five
residues (Nucleotides/amino acids). Pseudouridine is the most frequent
modification in RNA. The conversion of uridine to pseudouridine involves
interactions between pseudouridine synthase and RNA. The existing models to
predict the pseudouridine sites in a given RNA sequence mainly depend on
user-defined features such as mono and dinucleotide composition/propensities of
RNA sequences. Predicting pseudouridine sites is a non-linear classification
problem with limited data points. Deep Learning models are efficient
discriminators when the data set size is reasonably large and fail when there
is a paucity of data ($<1000$ samples). To mitigate this problem, we propose a
Support Vector Machine (SVM) Kernel based on utility theory from Economics, and
using data-driven parameters (i.e. MSU) as features. For this purpose, we have
used position-specific tri/quad/pentanucleotide composition/propensity
(PSPC/PSPP) besides nucleotide and dineculeotide composition as features. SVMs
are known to work well in small data regimes and kernels in SVM are designed to
classify non-linear data. The proposed model outperforms the existing
state-of-the-art models significantly (10%-15% on average).
Related papers
- BEACON: Benchmark for Comprehensive RNA Tasks and Language Models [60.02663015002029]
We introduce the first comprehensive RNA benchmark BEACON (textbfBEnchmtextbfArk for textbfCOmprehensive RtextbfNA Task and Language Models).
First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications.
Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models.
Third, we investigate the vital RNA language model components
arXiv Detail & Related papers (2024-06-14T19:39:19Z) - scHyena: Foundation Model for Full-Length Single-Cell RNA-Seq Analysis
in Brain [46.39828178736219]
We introduce scHyena, a foundation model designed to address these challenges and enhance the accuracy of scRNA-seq analysis in the brain.
scHyena is equipped with a linear adaptor layer, the positional encoding via gene-embedding, and a bidirectional Hyena operator.
This enables us to process full-length scRNA-seq data without losing any information from the raw data.
arXiv Detail & Related papers (2023-10-04T10:30:08Z) - Fast and Functional Structured Data Generators Rooted in
Out-of-Equilibrium Physics [62.997667081978825]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets.
Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing.
We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - Optirank: classification for RNA-Seq data with optimal ranking reference
genes [0.0]
We propose a logistic regression model, optirank, which learns simultaneously the parameters of the model and the genes to use as a reference set in the ranking.
We also consider real classification tasks, which present different kinds of distribution shifts between train and test data.
arXiv Detail & Related papers (2023-01-11T10:49:06Z) - MuCoMiD: A Multitask Convolutional Learning Framework for miRNA-Disease
Association Prediction [0.4061135251278187]
We propose a novel multi-tasking convolution-based approach, which we refer to as MuCoMiD.
MuCoMiD allows automatic feature extraction while incorporating knowledge from 4 heterogeneous biological information sources.
We construct large-scale experiments on standard benchmark datasets as well as our proposed larger independent test sets and case studies.
MuCoMiD shows an improvement of at least 5% in 5-fold CV evaluation on HMDDv2.0 and HMDDv3.0 datasets and at least 49% on larger independent test sets with unseen diseases and unseen diseases over state-of-the-art approaches.
arXiv Detail & Related papers (2021-08-08T10:01:46Z) - Machine learning for plant microRNA prediction: A systematic review [0.0]
MicroRNAs (miRNAs) are endogenous small non-coding RNAs that play an important role in gene regulation.
computational and machine learning-based approaches have been adopted to predict microRNAs.
This systematic review focuses on the machine learning methods developed for identification in plants.
arXiv Detail & Related papers (2021-06-29T08:22:57Z) - Rank-R FNN: A Tensor-Based Learning Model for High-Order Data
Classification [69.26747803963907]
Rank-R Feedforward Neural Network (FNN) is a tensor-based nonlinear learning model that imposes Canonical/Polyadic decomposition on its parameters.
First, it handles inputs as multilinear arrays, bypassing the need for vectorization, and can thus fully exploit the structural information along every data dimension.
We establish the universal approximation and learnability properties of Rank-R FNN, and we validate its performance on real-world hyperspectral datasets.
arXiv Detail & Related papers (2021-04-11T16:37:32Z) - Approximate kNN Classification for Biomedical Data [1.1852406625172218]
Single-cell RNA-seq (scRNA-seq) is an emerging DNA sequencing technology with promising capabilities but significant computational challenges.
We propose the utilization of approximate nearest neighbor search algorithms for the task of kNN classification in scRNA-seq data.
arXiv Detail & Related papers (2020-12-03T18:30:43Z) - A Systematic Approach to Featurization for Cancer Drug Sensitivity
Predictions with Deep Learning [49.86828302591469]
We train >35,000 neural network models, sweeping over common featurization techniques.
We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features.
arXiv Detail & Related papers (2020-04-30T20:42:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.