Resource saving taxonomy classification with k-mer distributions and
machine learning
- URL: http://arxiv.org/abs/2303.06154v1
- Date: Fri, 10 Mar 2023 08:01:08 GMT
- Title: Resource saving taxonomy classification with k-mer distributions and
machine learning
- Authors: Wolfgang Fuhl, Susanne Zabel, Kay Nieselt
- Abstract summary: We propose to use $k$-mer distributions obtained from DNA as features to classify its taxonomic origin.
We show that our approach improves the classification on the genus level and achieves comparable results for the superkingdom and phylum level.
- Score: 2.0196229393131726
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern high throughput sequencing technologies like metagenomic sequencing
generate millions of sequences which have to be classified based on their
taxonomic rank. Modern approaches either apply local alignment and comparison
to existing data sets like MMseqs2 or use deep neural networks as it is done in
DeepMicrobes and BERTax. Alignment-based approaches are costly in terms of
runtime, especially since databases get larger and larger. For the deep
learning-based approaches, specialized hardware is necessary for a computation,
which consumes large amounts of energy. In this paper, we propose to use
$k$-mer distributions obtained from DNA as features to classify its taxonomic
origin using machine learning approaches like the subspace $k$-nearest
neighbors algorithm, neural networks or bagged decision trees. In addition, we
propose a feature space data set balancing approach, which allows reducing the
data set for training and improves the performance of the classifiers. By
comparing performance, time, and memory consumption of our approach to those of
state-of-the-art algorithms (BERTax and MMseqs2) using several datasets, we
show that our approach improves the classification on the genus level and
achieves comparable results for the superkingdom and phylum level.
Link:
https://es-cloud.cs.uni-tuebingen.de/d/8e2ab8c3fdd444e1a135/?p=%2FTaxonomyClassification&mode=list
Related papers
- Adaptive $k$-nearest neighbor classifier based on the local estimation of the shape operator [49.87315310656657]
We introduce a new adaptive $k$-nearest neighbours ($kK$-NN) algorithm that explores the local curvature at a sample to adaptively defining the neighborhood size.
Results on many real-world datasets indicate that the new $kK$-NN algorithm yields superior balanced accuracy compared to the established $k$-NN method.
arXiv Detail & Related papers (2024-09-08T13:08:45Z) - Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences.
Exact methods yield better classification performance, but they pose high computational costs.
We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z) - Towards Meta-learned Algorithm Selection using Implicit Fidelity
Information [13.750624267664156]
IMFAS produces informative landmarks, easily enriched by arbitrary meta-features at a low computational cost.
We show it is able to beat Successive Halving with at most half the fidelity sequence during test time.
arXiv Detail & Related papers (2022-06-07T09:14:24Z) - Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states.
Our method is widely applicable to classical DP-based inference.
It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z) - Index $t$-SNE: Tracking Dynamics of High-Dimensional Datasets with
Coherent Embeddings [1.7188280334580195]
This paper presents a methodology to reuse an embedding to create a new one, where cluster positions are preserved.
The proposed algorithm has the same complexity as the original $t$-SNE to embed new items, and a lower one when considering the embedding of a dataset sliced into sub-pieces.
arXiv Detail & Related papers (2021-09-22T06:45:37Z) - Transfer learning based few-shot classification using optimal transport
mapping from preprocessed latent space of backbone neural network [0.0]
This paper describes second best submission in the competition.
Our meta learning approach modifies the distribution of classes in a latent space produced by a backbone network for each class.
For this task, we utilize optimal transport mapping using the Sinkhorn algorithm.
arXiv Detail & Related papers (2021-02-09T23:10:58Z) - Solving Mixed Integer Programs Using Neural Networks [57.683491412480635]
This paper applies learning to the two key sub-tasks of a MIP solver, generating a high-quality joint variable assignment, and bounding the gap in objective value between that assignment and an optimal one.
Our approach constructs two corresponding neural network-based components, Neural Diving and Neural Branching, to use in a base MIP solver such as SCIP.
We evaluate our approach on six diverse real-world datasets, including two Google production datasets and MIPLIB, by training separate neural networks on each.
arXiv Detail & Related papers (2020-12-23T09:33:11Z) - Adversarial Examples for $k$-Nearest Neighbor Classifiers Based on
Higher-Order Voronoi Diagrams [69.4411417775822]
Adversarial examples are a widely studied phenomenon in machine learning models.
We propose an algorithm for evaluating the adversarial robustness of $k$-nearest neighbor classification.
arXiv Detail & Related papers (2020-11-19T08:49:10Z) - Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge
Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles.
Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center.
We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes.
A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z) - Imbalance Learning for Variable Star Classification [0.0]
We develop a hierarchical machine learning classification scheme to overcome imbalanced learning problems.
We use 'data-level' approaches to directly augment the training data so that they better describe under-represented classes.
We find that a higher classification rate is obtained when using $texttGpFit$ in the hierarchical model.
arXiv Detail & Related papers (2020-02-27T19:01:05Z) - Scalable End-to-end Recurrent Neural Network for Variable star
classification [1.2722697496405464]
We propose an end-to-end algorithm that automatically learns the representation of light curves that allows an accurate automatic classification.
Our method uses minimal data preprocessing, can be updated with a low computational cost for new observations and light curves, and can scale up to massive datasets.
arXiv Detail & Related papers (2020-02-03T19:56:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.