Using Signal Processing in Tandem With Adapted Mixture Models for
Classifying Genomic Signals
- URL: http://arxiv.org/abs/2211.01603v1
- Date: Thu, 3 Nov 2022 06:10:55 GMT
- Title: Using Signal Processing in Tandem With Adapted Mixture Models for
Classifying Genomic Signals
- Authors: Saish Jaiswal, Shreya Nema, Hema A Murthy, Manikandan Narayanan
- Abstract summary: We propose a novel technique that employs signal processing in tandem with Gaussian mixture models to improve the spectral representation of a sequence.
Our method outperforms a similar state-of-the-art method on established benchmark datasets by an absolute margin of 6.06% accuracy.
- Score: 16.119729980200955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Genomic signal processing has been used successfully in bioinformatics to
analyze biomolecular sequences and gain varied insights into DNA structure,
gene organization, protein binding, sequence evolution, etc. But challenges
remain in finding the appropriate spectral representation of a biomolecular
sequence, especially when multiple variable-length sequences need to be handled
consistently. In this study, we address this challenge in the context of the
well-studied problem of classifying genomic sequences into different taxonomic
units (strain, phyla, order, etc.). We propose a novel technique that employs
signal processing in tandem with Gaussian mixture models to improve the
spectral representation of a sequence and subsequently the taxonomic
classification accuracies. The sequences are first transformed into spectra,
and projected to a subspace, where sequences belonging to different taxons are
better distinguishable. Our method outperforms a similar state-of-the-art
method on established benchmark datasets by an absolute margin of 6.06%
accuracy.
Related papers
- Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - DNA Sequence Classification with Compressors [0.0]
Our study introduces a novel adaptation of Jiang et al.'s compressor-based, parameter-free classification method, specifically tailored for DNA sequence analysis.
Not only does this method align with the current state-of-the-art in terms of accuracy, but it also offers a more resource-efficient alternative to traditional machine learning methods.
arXiv Detail & Related papers (2024-01-25T09:17:19Z) - Fast and Functional Structured Data Generators Rooted in Out-of-Equilibrium Physics [44.97217246897902]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets.
Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing.
We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z) - Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine
Learning [54.247560894146105]
Inverse design of short single-stranded RNA and DNA sequences (aptamers) is the task of finding sequences that satisfy a set of desired criteria.
We propose to use an unsupervised machine learning model known as the Potts model to discover new, useful sequences with controllable sequence diversity.
arXiv Detail & Related papers (2022-08-10T13:30:58Z) - Multiscale methods for signal selection in single-cell data [2.683475550237718]
We propose three topologically-motivated mathematical methods for unsupervised feature selection.
We demonstrate the utility of these techniques by applying them to published single-cell transcriptomics data sets.
arXiv Detail & Related papers (2022-06-15T18:42:26Z) - EvoVGM: A Deep Variational Generative Model for Evolutionary Parameter
Estimation [0.0]
We propose a method for a deep variational Bayesian generative model that jointly approximates the true posterior of local biological evolutionary parameters.
We show the consistency and effectiveness of the method on synthetic sequence alignments simulated with several evolutionary scenarios and on a real virus sequence alignment.
arXiv Detail & Related papers (2022-05-25T20:08:10Z) - Deep metric learning improves lab of origin prediction of genetically
engineered plasmids [63.05016513788047]
Genetic engineering attribution (GEA) is the ability to make sequence-lab associations.
We propose a method, based on metric learning, that ranks the most likely labs-of-origin.
We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
arXiv Detail & Related papers (2021-11-24T16:29:03Z) - A Novel Granular-Based Bi-Clustering Method of Deep Mining the
Co-Expressed Genes [76.84066556597342]
Bi-clustering methods are used to mine bi-clusters whose subsets of samples (genes) are co-regulated under their test conditions.
Unfortunately, traditional bi-clustering methods are not fully effective in discovering such bi-clusters.
We propose a novel bi-clustering method by involving here the theory of Granular Computing.
arXiv Detail & Related papers (2020-05-12T02:04:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.