DNA Sequence Classification with Compressors
- URL: http://arxiv.org/abs/2401.14025v1
- Date: Thu, 25 Jan 2024 09:17:19 GMT
- Title: DNA Sequence Classification with Compressors
- Authors: \c{S}\"ukr\"u Ozan
- Abstract summary: Our study introduces a novel adaptation of Jiang et al.'s compressor-based, parameter-free classification method, specifically tailored for DNA sequence analysis.
Not only does this method align with the current state-of-the-art in terms of accuracy, but it also offers a more resource-efficient alternative to traditional machine learning methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies in DNA sequence classification have leveraged sophisticated
machine learning techniques, achieving notable accuracy in categorizing complex
genomic data. Among these, methods such as k-mer counting have proven effective
in distinguishing sequences from varied species like chimpanzees, dogs, and
humans, becoming a staple in contemporary genomic research. However, these
approaches often demand extensive computational resources, posing a challenge
in terms of scalability and efficiency. Addressing this issue, our study
introduces a novel adaptation of Jiang et al.'s compressor-based,
parameter-free classification method, specifically tailored for DNA sequence
analysis. This innovative approach utilizes a variety of compression
algorithms, such as Gzip, Brotli, and LZMA, to efficiently process and classify
genomic sequences. Not only does this method align with the current
state-of-the-art in terms of accuracy, but it also offers a more
resource-efficient alternative to traditional machine learning methods. Our
comprehensive evaluation demonstrates the proposed method's effectiveness in
accurately classifying DNA sequences from multiple species. We present a
detailed analysis of the performance of each algorithm used, highlighting the
strengths and limitations of our approach in various genomic contexts.
Furthermore, we discuss the broader implications of our findings for
bioinformatics, particularly in genomic data processing and analysis. The
results of our study pave the way for more efficient and scalable DNA sequence
classification methods, offering significant potential for advancements in
genomic research and applications.
Related papers
- Stacked ensemble\-based mutagenicity prediction model using multiple modalities with graph attention network [0.9736758288065405]
Mutagenicity is a concern due to its association with genetic mutations which can result in a variety of negative consequences.
In this work, we introduce a novel stacked ensemble based mutagenicity prediction model.
arXiv Detail & Related papers (2024-09-03T09:14:21Z) - Enhanced Gene Selection in Single-Cell Genomics: Pre-Filtering Synergy and Reinforced Optimization [16.491060073775884]
We introduce an iterative gene panel selection strategy applicable to clustering tasks in single-cell genomics.
Our method integrates results from other gene selection algorithms, providing valuable preliminary boundaries.
We incorporate the nature of the exploration process in reinforcement learning (RL) and its capability for continuous optimization.
arXiv Detail & Related papers (2024-06-11T16:21:33Z) - Single-Cell Deep Clustering Method Assisted by Exogenous Gene
Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells.
During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation.
This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z) - Fast and Functional Structured Data Generators Rooted in
Out-of-Equilibrium Physics [62.997667081978825]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets.
Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing.
We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z) - Machine Learning Methods for Cancer Classification Using Gene Expression
Data: A Review [77.34726150561087]
Cancer is the second major cause of death after cardiovascular diseases.
Gene expression can play a fundamental role in the early detection of cancer.
This study reviews recent progress in gene expression analysis for cancer classification using machine learning methods.
arXiv Detail & Related papers (2023-01-28T15:03:03Z) - Using Signal Processing in Tandem With Adapted Mixture Models for
Classifying Genomic Signals [16.119729980200955]
We propose a novel technique that employs signal processing in tandem with Gaussian mixture models to improve the spectral representation of a sequence.
Our method outperforms a similar state-of-the-art method on established benchmark datasets by an absolute margin of 6.06% accuracy.
arXiv Detail & Related papers (2022-11-03T06:10:55Z) - Deep metric learning improves lab of origin prediction of genetically
engineered plasmids [63.05016513788047]
Genetic engineering attribution (GEA) is the ability to make sequence-lab associations.
We propose a method, based on metric learning, that ranks the most likely labs-of-origin.
We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
arXiv Detail & Related papers (2021-11-24T16:29:03Z) - Improving RNA Secondary Structure Design using Deep Reinforcement
Learning [69.63971634605797]
We propose a new benchmark of applying reinforcement learning to RNA sequence design, in which the objective function is defined to be the free energy in the sequence's secondary structure.
We show results of the ablation analysis that we do for these algorithms, as well as graphs indicating the algorithm's performance across batches.
arXiv Detail & Related papers (2021-11-05T02:54:06Z) - Comparing Machine Learning Algorithms with or without Feature Extraction
for DNA Classification [0.7742297876120561]
Three state-of-the-art algorithms, namely Convolutional Neural Networks, Deep Neural Networks, and N-gram Probabilistic Models, are used for the task of DNA classification.
We introduce a novel feature extraction method based on the Levenshtein distance and randomly generated DNA sub-sequences.
Four different data sets, each concerning viral diseases such as Covid-19, AIDS, Influenza, and Hepatitis C, are used for evaluating the different approaches.
arXiv Detail & Related papers (2020-11-01T12:04:54Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.