Related papers: Resource saving taxonomy classification with k-mer distributions and machine learning

Resource saving taxonomy classification with k-mer distributions and machine learning

URL: http://arxiv.org/abs/2303.06154v1
Date: Fri, 10 Mar 2023 08:01:08 GMT
Title: Resource saving taxonomy classification with k-mer distributions and machine learning
Authors: Wolfgang Fuhl, Susanne Zabel, Kay Nieselt
Abstract summary: We propose to use $k$-mer distributions obtained from DNA as features to classify its taxonomic origin. We show that our approach improves the classification on the genus level and achieves comparable results for the superkingdom and phylum level.
Score: 2.0196229393131726
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern high throughput sequencing technologies like metagenomic sequencing generate millions of sequences which have to be classified based on their taxonomic rank. Modern approaches either apply local alignment and comparison to existing data sets like MMseqs2 or use deep neural networks as it is done in DeepMicrobes and BERTax. Alignment-based approaches are costly in terms of runtime, especially since databases get larger and larger. For the deep learning-based approaches, specialized hardware is necessary for a computation, which consumes large amounts of energy. In this paper, we propose to use $k$-mer distributions obtained from DNA as features to classify its taxonomic origin using machine learning approaches like the subspace $k$-nearest neighbors algorithm, neural networks or bagged decision trees. In addition, we propose a feature space data set balancing approach, which allows reducing the data set for training and improves the performance of the classifiers. By comparing performance, time, and memory consumption of our approach to those of state-of-the-art algorithms (BERTax and MMseqs2) using several datasets, we show that our approach improves the classification on the genus level and achieves comparable results for the superkingdom and phylum level. Link: https://es-cloud.cs.uni-tuebingen.de/d/8e2ab8c3fdd444e1a135/?p=%2FTaxonomyClassification&mode=list

Related papers

Data Selection for ERMs [67.57726352698933]
We study how well can $mathcalA$ perform when trained on at most $nll N$ data points selected from a population of $N$ points. Our results include optimal data-selection bounds for mean estimation, linear classification, and linear regression.
arXiv Detail & Related papers (2025-04-20T11:26:01Z)
Adaptive $k$-nearest neighbor classifier based on the local estimation of the shape operator [49.87315310656657]
We introduce a new adaptive $k$-nearest neighbours ($kK$-NN) algorithm that explores the local curvature at a sample to adaptively defining the neighborhood size. Results on many real-world datasets indicate that the new $kK$-NN algorithm yields superior balanced accuracy compared to the established $k$-NN method.
arXiv Detail & Related papers (2024-09-08T13:08:45Z)
Revisiting Nearest Neighbor for Tabular Data: A Deep Tabular Baseline Two Decades Later [76.66498833720411]
We introduce a differentiable version of $K$-nearest neighbors (KNN) originally designed to learn a linear projection to capture semantic similarities between instances. Surprisingly, our implementation of NCA using SGD and without dimensionality reduction already achieves decent performance on tabular data. We conclude our paper by analyzing the factors behind these improvements, including loss functions, prediction strategies, and deep architectures.
arXiv Detail & Related papers (2024-07-03T16:38:57Z)
Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences. Exact methods yield better classification performance, but they pose high computational costs. We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z)
Towards Meta-learned Algorithm Selection using Implicit Fidelity Information [13.750624267664156]
IMFAS produces informative landmarks, easily enriched by arbitrary meta-features at a low computational cost. We show it is able to beat Successive Halving with at most half the fidelity sequence during test time.
arXiv Detail & Related papers (2022-06-07T09:14:24Z)
Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states. Our method is widely applicable to classical DP-based inference. It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z)
Index $t$-SNE: Tracking Dynamics of High-Dimensional Datasets with Coherent Embeddings [1.7188280334580195]
This paper presents a methodology to reuse an embedding to create a new one, where cluster positions are preserved. The proposed algorithm has the same complexity as the original $t$-SNE to embed new items, and a lower one when considering the embedding of a dataset sliced into sub-pieces.
arXiv Detail & Related papers (2021-09-22T06:45:37Z)
Transfer learning based few-shot classification using optimal transport mapping from preprocessed latent space of backbone neural network [0.0]
This paper describes second best submission in the competition. Our meta learning approach modifies the distribution of classes in a latent space produced by a backbone network for each class. For this task, we utilize optimal transport mapping using the Sinkhorn algorithm.
arXiv Detail & Related papers (2021-02-09T23:10:58Z)
Solving Mixed Integer Programs Using Neural Networks [57.683491412480635]
This paper applies learning to the two key sub-tasks of a MIP solver, generating a high-quality joint variable assignment, and bounding the gap in objective value between that assignment and an optimal one. Our approach constructs two corresponding neural network-based components, Neural Diving and Neural Branching, to use in a base MIP solver such as SCIP. We evaluate our approach on six diverse real-world datasets, including two Google production datasets and MIPLIB, by training separate neural networks on each.
arXiv Detail & Related papers (2020-12-23T09:33:11Z)
Adversarial Examples for $k$-Nearest Neighbor Classifiers Based on Higher-Order Voronoi Diagrams [69.4411417775822]
Adversarial examples are a widely studied phenomenon in machine learning models. We propose an algorithm for evaluating the adversarial robustness of $k$-nearest neighbor classification.
arXiv Detail & Related papers (2020-11-19T08:49:10Z)
Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles. Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center. We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes. A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z)
Imbalance Learning for Variable Star Classification [0.0]
We develop a hierarchical machine learning classification scheme to overcome imbalanced learning problems. We use 'data-level' approaches to directly augment the training data so that they better describe under-represented classes. We find that a higher classification rate is obtained when using $texttGpFit$ in the hierarchical model.
arXiv Detail & Related papers (2020-02-27T19:01:05Z)
Scalable End-to-end Recurrent Neural Network for Variable star classification [1.2722697496405464]
We propose an end-to-end algorithm that automatically learns the representation of light curves that allows an accurate automatic classification. Our method uses minimal data preprocessing, can be updated with a low computational cost for new observations and light curves, and can scale up to massive datasets.
arXiv Detail & Related papers (2020-02-03T19:56:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.