Comparison of machine learning and deep learning techniques in promoter
prediction across diverse species
- URL: http://arxiv.org/abs/2105.07659v1
- Date: Mon, 17 May 2021 08:15:41 GMT
- Title: Comparison of machine learning and deep learning techniques in promoter
prediction across diverse species
- Authors: Nikita Bhandari, Satyajeet Khare, Rahee Walambe, Ketan Kotecha
- Abstract summary: We studied methods for vector encoding and promoter classification using genome sequences of three higher eukaryotes viz. yeast, A. thaliana and human.
We found CNN to be superior in classification of promoters from non-promoter sequences (binary classification) as well as species-specific classification of promoter sequences (multiclass classification)
- Score: 1.8899300124593648
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Gene promoters are the key DNA regulatory elements positioned around the
transcription start sites and are responsible for regulating gene transcription
process. Various alignment-based, signal-based and content-based approaches are
reported for the prediction of promoters. However, since all promoter sequences
do not show explicit features, the prediction performance of these techniques
is poor. Therefore, many machine learning and deep learning models have been
proposed for promoter prediction. In this work, we studied methods for vector
encoding and promoter classification using genome sequences of three distinct
higher eukaryotes viz. yeast (Saccharomyces cerevisiae), A. thaliana (plant)
and human (Homo sapiens). We compared one-hot vector encoding method with
frequency-based tokenization (FBT) for data pre-processing on 1-D Convolutional
Neural Network (CNN) model. We found that FBT gives a shorter input dimension
reducing the training time without affecting the sensitivity and specificity of
classification. We employed the deep learning techniques, mainly CNN and
recurrent neural network with Long Short Term Memory (LSTM) and random forest
(RF) classifier for promoter classification at k-mer sizes of 2, 4 and 8. We
found CNN to be superior in classification of promoters from non-promoter
sequences (binary classification) as well as species-specific classification of
promoter sequences (multiclass classification). In summary, the contribution of
this work lies in the use of synthetic shuffled negative dataset and
frequency-based tokenization for pre-processing. This study provides a
comprehensive and generic framework for classification tasks in genomic
applications and can be extended to various classification problems.
Related papers
- Fixed Random Classifier Rearrangement for Continual Learning [0.5439020425819]
In visual classification scenario, neural networks inevitably forget the knowledge of old tasks after learning new ones.
We propose a continual learning algorithm named Fixed Random Rearrangement (FRCR)
arXiv Detail & Related papers (2024-02-23T09:43:58Z) - Neural networks for insurance pricing with frequency and severity data: a benchmark study from data preprocessing to technical tariff [2.4578723416255754]
We present a benchmark study on four insurance data sets with frequency and severity targets in the presence of multiple types of input features.
We compare in detail the performance of a generalized linear model on binned input data, a gradient-boosted tree model, a feed-forward neural network (FFNN), and the combined actuarial neural network (CANN)
arXiv Detail & Related papers (2023-10-19T12:00:33Z) - Class Binarization to NeuroEvolution for Multiclass Classification [9.179400849826216]
Multiclass classification is a fundamental and challenging task in machine learning.
Decomposing multiclass classification into a set of binary classifications is called class binarization.
We propose a new method that applies Error-Correcting Output Codes (ECOC) to design the class binarization strategies on the neuroevolution for multiclass classification.
arXiv Detail & Related papers (2023-08-26T13:26:13Z) - Optirank: classification for RNA-Seq data with optimal ranking reference
genes [0.0]
We propose a logistic regression model, optirank, which learns simultaneously the parameters of the model and the genes to use as a reference set in the ranking.
We also consider real classification tasks, which present different kinds of distribution shifts between train and test data.
arXiv Detail & Related papers (2023-01-11T10:49:06Z) - Domain Adaptive Nuclei Instance Segmentation and Classification via
Category-aware Feature Alignment and Pseudo-labelling [65.40672505658213]
We propose a novel deep neural network, namely Category-Aware feature alignment and Pseudo-Labelling Network (CAPL-Net) for UDA nuclei instance segmentation and classification.
Our approach outperforms state-of-the-art UDA methods with a remarkable margin.
arXiv Detail & Related papers (2022-07-04T07:05:06Z) - Do We Really Need a Learnable Classifier at the End of Deep Neural
Network? [118.18554882199676]
We study the potential of learning a neural network for classification with the classifier randomly as an ETF and fixed during training.
Our experimental results show that our method is able to achieve similar performances on image classification for balanced datasets.
arXiv Detail & Related papers (2022-03-17T04:34:28Z) - Prototypical Classifier for Robust Class-Imbalanced Learning [64.96088324684683]
We propose textitPrototypical, which does not require fitting additional parameters given the embedding network.
Prototypical produces balanced and comparable predictions for all classes even though the training set is class-imbalanced.
We test our method on CIFAR-10LT, CIFAR-100LT and Webvision datasets, observing that Prototypical obtains substaintial improvements compared with state of the arts.
arXiv Detail & Related papers (2021-10-22T01:55:01Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z) - ECINN: Efficient Counterfactuals from Invertible Neural Networks [80.94500245955591]
We propose a method, ECINN, that utilizes the generative capacities of invertible neural networks for image classification to generate counterfactual examples efficiently.
ECINN has a closed-form expression and generates a counterfactual in the time of only two evaluations.
Our experiments demonstrate how ECINN alters class-dependent image regions to change the perceptual and predicted class of the counterfactuals.
arXiv Detail & Related papers (2021-03-25T09:23:24Z) - A Systematic Approach to Featurization for Cancer Drug Sensitivity
Predictions with Deep Learning [49.86828302591469]
We train >35,000 neural network models, sweeping over common featurization techniques.
We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features.
arXiv Detail & Related papers (2020-04-30T20:42:17Z) - Robust Classification of High-Dimensional Spectroscopy Data Using Deep
Learning and Data Synthesis [0.5801044612920815]
A novel application of a locally-connected neural network (NN) for the binary classification of spectroscopy data is proposed.
A two-step classification process is presented as an alternative to the binary and one-class classification paradigms.
arXiv Detail & Related papers (2020-03-26T11:33:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.