Related papers: Comparison of machine learning and deep learning techniques in promoter prediction across diverse species

Comparison of machine learning and deep learning techniques in promoter prediction across diverse species

URL: http://arxiv.org/abs/2105.07659v1
Date: Mon, 17 May 2021 08:15:41 GMT
Title: Comparison of machine learning and deep learning techniques in promoter prediction across diverse species
Authors: Nikita Bhandari, Satyajeet Khare, Rahee Walambe, Ketan Kotecha
Abstract summary: We studied methods for vector encoding and promoter classification using genome sequences of three higher eukaryotes viz. yeast, A. thaliana and human. We found CNN to be superior in classification of promoters from non-promoter sequences (binary classification) as well as species-specific classification of promoter sequences (multiclass classification)
Score: 1.8899300124593648
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promoter sequences do not show explicit features, the prediction performance of these techniques is poor. Therefore, many machine learning and deep learning models have been proposed for promoter prediction. In this work, we studied methods for vector encoding and promoter classification using genome sequences of three distinct higher eukaryotes viz. yeast (Saccharomyces cerevisiae), A. thaliana (plant) and human (Homo sapiens). We compared one-hot vector encoding method with frequency-based tokenization (FBT) for data pre-processing on 1-D Convolutional Neural Network (CNN) model. We found that FBT gives a shorter input dimension reducing the training time without affecting the sensitivity and specificity of classification. We employed the deep learning techniques, mainly CNN and recurrent neural network with Long Short Term Memory (LSTM) and random forest (RF) classifier for promoter classification at k-mer sizes of 2, 4 and 8. We found CNN to be superior in classification of promoters from non-promoter sequences (binary classification) as well as species-specific classification of promoter sequences (multiclass classification). In summary, the contribution of this work lies in the use of synthetic shuffled negative dataset and frequency-based tokenization for pre-processing. This study provides a comprehensive and generic framework for classification tasks in genomic applications and can be extended to various classification problems.

Related papers

Enhancing Downstream Analysis in Genome Sequencing: Species Classification While Basecalling [1.0017486177151396]
This paper introduces a novel method to profile signals coming from sequencing devices in parallel with determining their nucleotide sequences. We introduce a new loss strategy where losses for basecalling and classification are back-propagated separately, with model weights combined for the shared layers. We achieve state-of-the-art basecalling accuracies, while classification accuracies meet and exceed the results of state-of-the-art binary classifiers.
arXiv Detail & Related papers (2025-04-09T17:30:43Z)
Regulatory DNA sequence Design with Reinforcement Learning [56.20290878358356]
We propose a generative approach that leverages reinforcement learning to fine-tune a pre-trained autoregressive model. We evaluate our method on promoter design tasks in two yeast media conditions and enhancer design tasks for three human cell types.
arXiv Detail & Related papers (2025-03-11T02:33:33Z)
Fixed Random Classifier Rearrangement for Continual Learning [0.5439020425819]
In visual classification scenario, neural networks inevitably forget the knowledge of old tasks after learning new ones. We propose a continual learning algorithm named Fixed Random Rearrangement (FRCR)
arXiv Detail & Related papers (2024-02-23T09:43:58Z)
Neural networks for insurance pricing with frequency and severity data: a benchmark study from data preprocessing to technical tariff [2.4578723416255754]
We present a benchmark study on four insurance data sets with frequency and severity targets in the presence of multiple types of input features. We compare in detail the performance of a generalized linear model on binned input data, a gradient-boosted tree model, a feed-forward neural network (FFNN), and the combined actuarial neural network (CANN)
arXiv Detail & Related papers (2023-10-19T12:00:33Z)
Class Binarization to NeuroEvolution for Multiclass Classification [9.179400849826216]
Multiclass classification is a fundamental and challenging task in machine learning. Decomposing multiclass classification into a set of binary classifications is called class binarization. We propose a new method that applies Error-Correcting Output Codes (ECOC) to design the class binarization strategies on the neuroevolution for multiclass classification.
arXiv Detail & Related papers (2023-08-26T13:26:13Z)
Optirank: classification for RNA-Seq data with optimal ranking reference genes [0.0]
We propose a logistic regression model, optirank, which learns simultaneously the parameters of the model and the genes to use as a reference set in the ranking. We also consider real classification tasks, which present different kinds of distribution shifts between train and test data.
arXiv Detail & Related papers (2023-01-11T10:49:06Z)
Domain Adaptive Nuclei Instance Segmentation and Classification via Category-aware Feature Alignment and Pseudo-labelling [65.40672505658213]
We propose a novel deep neural network, namely Category-Aware feature alignment and Pseudo-Labelling Network (CAPL-Net) for UDA nuclei instance segmentation and classification. Our approach outperforms state-of-the-art UDA methods with a remarkable margin.
arXiv Detail & Related papers (2022-07-04T07:05:06Z)
Do We Really Need a Learnable Classifier at the End of Deep Neural Network? [118.18554882199676]
We study the potential of learning a neural network for classification with the classifier randomly as an ETF and fixed during training. Our experimental results show that our method is able to achieve similar performances on image classification for balanced datasets.
arXiv Detail & Related papers (2022-03-17T04:34:28Z)
Prototypical Classifier for Robust Class-Imbalanced Learning [64.96088324684683]
We propose textitPrototypical, which does not require fitting additional parameters given the embedding network. Prototypical produces balanced and comparable predictions for all classes even though the training set is class-imbalanced. We test our method on CIFAR-10LT, CIFAR-100LT and Webvision datasets, observing that Prototypical obtains substaintial improvements compared with state of the arts.
arXiv Detail & Related papers (2021-10-22T01:55:01Z)
Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT. We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z)
ECINN: Efficient Counterfactuals from Invertible Neural Networks [80.94500245955591]
We propose a method, ECINN, that utilizes the generative capacities of invertible neural networks for image classification to generate counterfactual examples efficiently. ECINN has a closed-form expression and generates a counterfactual in the time of only two evaluations. Our experiments demonstrate how ECINN alters class-dependent image regions to change the perceptual and predicted class of the counterfactuals.
arXiv Detail & Related papers (2021-03-25T09:23:24Z)
A Systematic Approach to Featurization for Cancer Drug Sensitivity Predictions with Deep Learning [49.86828302591469]
We train >35,000 neural network models, sweeping over common featurization techniques. We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features.
arXiv Detail & Related papers (2020-04-30T20:42:17Z)
Robust Classification of High-Dimensional Spectroscopy Data Using Deep Learning and Data Synthesis [0.5801044612920815]
A novel application of a locally-connected neural network (NN) for the binary classification of spectroscopy data is proposed. A two-step classification process is presented as an alternative to the binary and one-class classification paradigms.
arXiv Detail & Related papers (2020-03-26T11:33:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.