BioSequence2Vec: Efficient Embedding Generation For Biological Sequences
- URL: http://arxiv.org/abs/2304.00291v1
- Date: Sat, 1 Apr 2023 10:58:21 GMT
- Title: BioSequence2Vec: Efficient Embedding Generation For Biological Sequences
- Authors: Sarwan Ali, Usama Sardar, Murray Patterson, Imdad Ullah Khan
- Abstract summary: We propose a general-purpose representation learning approach that embodies kernel methods' qualities while avoiding computation, memory, and generalizability challenges.
Our proposed fast and alignment-free embedding method can be used as input to any distance.
We perform a variety of real-world classification tasks, such as SARS-CoV-2 lineage and gene family classification, outperforming several state-of-the-art embedding and kernel methods in predictive performance.
- Score: 1.0896567381206714
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Representation learning is an important step in the machine learning
pipeline. Given the current biological sequencing data volume, learning an
explicit representation is prohibitive due to the dimensionality of the
resulting feature vectors. Kernel-based methods, e.g., SVM, are a proven
efficient and useful alternative for several machine learning (ML) tasks such
as sequence classification. Three challenges with kernel methods are (i) the
computation time, (ii) the memory usage (storing an $n\times n$ matrix), and
(iii) the usage of kernel matrices limited to kernel-based ML methods
(difficult to generalize on non-kernel classifiers). While (i) can be solved
using approximate methods, challenge (ii) remains for typical kernel methods.
Similarly, although non-kernel-based ML methods can be applied to kernel
matrices by extracting principal components (kernel PCA), it may result in
information loss, while being computationally expensive. In this paper, we
propose a general-purpose representation learning approach that embodies kernel
methods' qualities while avoiding computation, memory, and generalizability
challenges. This involves computing a low-dimensional embedding of each
sequence, using random projections of its $k$-mer frequency vectors,
significantly reducing the computation needed to compute the dot product and
the memory needed to store the resulting representation. Our proposed fast and
alignment-free embedding method can be used as input to any distance (e.g., $k$
nearest neighbors) and non-distance (e.g., decision tree) based ML method for
classification and clustering tasks. Using different forms of biological
sequences as input, we perform a variety of real-world classification tasks,
such as SARS-CoV-2 lineage and gene family classification, outperforming
several state-of-the-art embedding and kernel methods in predictive
performance.
Related papers
- Heterogenous Memory Augmented Neural Networks [84.29338268789684]
We introduce a novel heterogeneous memory augmentation approach for neural networks.
By introducing learnable memory tokens with attention mechanism, we can effectively boost performance without huge computational overhead.
We show our approach on various image and graph-based tasks under both in-distribution (ID) and out-of-distribution (OOD) conditions.
arXiv Detail & Related papers (2023-10-17T01:05:28Z) - Provably Efficient Representation Learning with Tractable Planning in
Low-Rank POMDP [81.00800920928621]
We study representation learning in partially observable Markov Decision Processes (POMDPs)
We first present an algorithm for decodable POMDPs that combines maximum likelihood estimation (MLE) and optimism in the face of uncertainty (OFU)
We then show how to adapt this algorithm to also work in the broader class of $gamma$-observable POMDPs.
arXiv Detail & Related papers (2023-06-21T16:04:03Z) - Quick Adaptive Ternary Segmentation: An Efficient Decoding Procedure For
Hidden Markov Models [70.26374282390401]
Decoding the original signal (i.e., hidden chain) from the noisy observations is one of the main goals in nearly all HMM based data analyses.
We present Quick Adaptive Ternary (QATS), a divide-and-conquer procedure which decodes the hidden sequence in polylogarithmic computational complexity.
arXiv Detail & Related papers (2023-05-29T19:37:48Z) - Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences.
Exact methods yield better classification performance, but they pose high computational costs.
We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z) - Learning "best" kernels from data in Gaussian process regression. With
application to aerodynamics [0.4588028371034406]
We introduce algorithms to select/design kernels in Gaussian process regression/kriging surrogate modeling techniques.
A first class of algorithms is kernel flow, which was introduced in a context of classification in machine learning.
A second class of algorithms is called spectral kernel ridge regression, and aims at selecting a "best" kernel such that the norm of the function to be approximated is minimal.
arXiv Detail & Related papers (2022-06-03T07:50:54Z) - High-Dimensional Sparse Bayesian Learning without Covariance Matrices [66.60078365202867]
We introduce a new inference scheme that avoids explicit construction of the covariance matrix.
Our approach couples a little-known diagonal estimation result from numerical linear algebra with the conjugate gradient algorithm.
On several simulations, our method scales better than existing approaches in computation time and memory.
arXiv Detail & Related papers (2022-02-25T16:35:26Z) - Learning in High-Dimensional Feature Spaces Using ANOVA-Based Fast
Matrix-Vector Multiplication [0.0]
kernel matrix is typically dense and large-scale. Depending on the dimension of the feature space even the computation of all of its entries in reasonable time becomes a challenging task.
We propose the use of an ANOVA kernel, where we construct several kernels based on lower-dimensional feature spaces for which we provide fast algorithms realizing the matrix-vector products.
Based on a feature grouping approach, we then show how the fast matrix-vector products can be embedded into a learning method choosing kernel ridge regression and the preconditioned conjugate gradient solver.
arXiv Detail & Related papers (2021-11-19T10:29:39Z) - Tensor Network Kalman Filtering for Large-Scale LS-SVMs [17.36231167296782]
Least squares support vector machines are used for nonlinear regression and classification.
A framework based on tensor networks and the Kalman filter is presented to alleviate the demanding memory and computational complexities.
Results show that our method can achieve high performance and is particularly useful when alternative methods are computationally infeasible.
arXiv Detail & Related papers (2021-10-26T08:54:03Z) - The Fast Kernel Transform [21.001203328543006]
We propose the Fast Kernel Transform (FKT), a general algorithm to compute matrix-vector multiplications for datasets in moderate dimensions with quasilinear complexity.
The FKT is easily applied to a broad class of kernels, including Gaussian, Matern, and Rational Quadratic covariance functions and physically motivated Green's functions.
We illustrate the efficacy and versatility of the FKT by providing timing and accuracy benchmarks and by applying it to scale the neighborhood embedding (t-SNE) and Gaussian processes to large real-world data sets.
arXiv Detail & Related papers (2021-06-08T16:15:47Z) - Spectral Learning on Matrices and Tensors [74.88243719463053]
We show that tensor decomposition can pick up latent effects that are missed by matrix methods.
We also outline computational techniques to design efficient tensor decomposition methods.
arXiv Detail & Related papers (2020-04-16T22:53:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.