Related papers: Nearest Neighbor CCP-Based Molecular Sequence Analysis

Nearest Neighbor CCP-Based Molecular Sequence Analysis

URL: http://arxiv.org/abs/2409.04922v1
Date: Sat, 7 Sep 2024 22:06:00 GMT
Title: Nearest Neighbor CCP-Based Molecular Sequence Analysis
Authors: Sarwan Ali, Prakash Chourasia, Bipin Koirala, Murray Patterson,
Abstract summary: Correlated Clustering and Projection (CCP) has been proposed as an effective method for biological sequencing data. We present a Nearest Neighbor Correlated Clustering and Projection (CCP-NN)-based technique for efficiently preprocessing molecular sequence data. Our findings show that CCP-NN considerably improves classification task accuracy as well as significantly outperforms CCP in terms of computational runtime.
Score: 4.199844472131922
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Molecular sequence analysis is crucial for comprehending several biological processes, including protein-protein interactions, functional annotation, and disease classification. The large number of sequences and the inherently complicated nature of protein structures make it challenging to analyze such data. Finding patterns and enhancing subsequent research requires the use of dimensionality reduction and feature selection approaches. Recently, a method called Correlated Clustering and Projection (CCP) has been proposed as an effective method for biological sequencing data. The CCP technique is still costly to compute even though it is effective for sequence visualization. Furthermore, its utility for classifying molecular sequences is still uncertain. To solve these two problems, we present a Nearest Neighbor Correlated Clustering and Projection (CCP-NN)-based technique for efficiently preprocessing molecular sequence data. To group related molecular sequences and produce representative supersequences, CCP makes use of sequence-to-sequence correlations. As opposed to conventional methods, CCP doesn't rely on matrix diagonalization, therefore it can be applied to a range of machine-learning problems. We estimate the density map and compute the correlation using a nearest-neighbor search technique. We performed molecular sequence classification using CCP and CCP-NN representations to assess the efficacy of our proposed approach. Our findings show that CCP-NN considerably improves classification task accuracy as well as significantly outperforms CCP in terms of computational runtime.

Related papers

K*-Means: A Parameter-free Clustering Algorithm [55.20132267309382]
k*-means is a novel clustering algorithm that eliminates the need to set k or any other parameters.<n>It uses the minimum description length principle to automatically determine the optimal number of clusters, k*, by splitting and merging clusters.<n>We prove that k*-means is guaranteed to converge and demonstrate experimentally that it significantly outperforms existing methods in scenarios where k is unknown.
arXiv Detail & Related papers (2025-05-17T08:41:07Z)
Survey on Algorithms for multi-index models [45.143425167349314]
We review the literature on algorithms for estimating the index space in a multi-index model. The primary focus is on computationally efficient (polynomial-time) algorithms in Gaussian space, the assumptions under which consistency is guaranteed by these methods, and their sample complexity.
arXiv Detail & Related papers (2025-04-07T18:50:11Z)
Graph Canonical Correlation Analysis [2.588462392029118]
Canonical correlation analysis (CCA) is a widely used technique for estimating associations between two sets of variables. Recent advancements in CCA methods have expanded their application to decipher the interactions of multiomics datasets. We propose the graph Canonical Correlation Analysis (gCCA) approach, which calculates canonical correlations based on the graph structure of the cross-correlation matrix.
arXiv Detail & Related papers (2025-02-03T19:41:06Z)
K-Nearest-Neighbors Induced Topological PCA for scRNA Sequence Data Analysis [0.3683202928838613]
We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L$_2,1$ norm regularization. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse scRNA-seq datasets.
arXiv Detail & Related papers (2023-10-23T03:07:50Z)
Unconstrained Stochastic CCA: Unifying Multiview and Self-Supervised Learning [0.13654846342364307]
We present a family of fast algorithms for PLS, CCA, and Deep CCA on all standard CCA and Deep CCA benchmarks. Our algorithms show far faster convergence and recover higher correlations than the previous state-of-the-art benchmarks. These improvements allow us to perform a first-of-its-kind PLS analysis of an extremely large biomedical dataset.
arXiv Detail & Related papers (2023-10-02T09:03:59Z)
Provably Efficient UCB-type Algorithms For Learning Predictive State Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs) This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models. In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z)
Analyzing scRNA-seq data by CCP-assisted UMAP and t-SNE [0.0]
Correlated clustering and projection (CCP) was introduced as an effective method for preprocessing scRNA-seq data. CCP is a data-domain approach that does not require matrix diagonalization. By using eight publicly available datasets, we have found that CCP significantly improves UMAP and t-SNE visualization.
arXiv Detail & Related papers (2023-06-23T19:15:43Z)
HD-Bind: Encoding of Molecular Structure with Low Precision, Hyperdimensional Binary Representations [3.3934198248179026]
Hyperdimensional Computing (HDC) is a proposed learning paradigm that is able to leverage low-precision binary vector arithmetic. We show that HDC-based inference methods are as much as 90 times more efficient than more complex representative machine learning methods.
arXiv Detail & Related papers (2023-03-27T21:21:46Z)
Fast conformational clustering of extensive molecular dynamics simulation data [19.444636864515726]
We present an unsupervised data processing workflow that is specifically designed to obtain a fast conformational clustering of long trajectories. We combine two dimensionality reduction algorithms (cc_analysis and encodermap) with a density-based spatial clustering algorithm (HDBSCAN) With the help of four test systems we illustrate the capability and performance of this clustering workflow.
arXiv Detail & Related papers (2023-01-11T14:36:43Z)
Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences. Exact methods yield better classification performance, but they pose high computational costs. We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z)
Correlation Clustering in Constant Many Parallel Rounds [42.10280805559555]
Correlation clustering is a central topic in unsupervised learning, with many applications in ML and data mining. In this work we propose a massively parallel computation (MPC) algorithm for this problem that is considerably faster than prior work. Our algorithm uses machines with memory sublinear in the number of nodes in the graph and returns a constant approximation while running only for a constant number of rounds.
arXiv Detail & Related papers (2021-06-15T21:45:45Z)
Exact Optimization of Conformal Predictors via Incremental and Decremental Learning [46.9970555048259]
Conformal Predictors (CP) are wrappers around ML methods, providing error guarantees under weak assumptions on the data distribution. They are suitable for a wide range of problems, from classification and regression to anomaly detection. We show that it is possible to speed up a CP classifier considerably, by studying it in conjunction with the underlying ML method, and by exploiting incremental&decremental learning.
arXiv Detail & Related papers (2021-02-05T15:31:37Z)
Progressive Spatio-Temporal Graph Convolutional Network for Skeleton-Based Human Action Recognition [97.14064057840089]
We propose a method to automatically find a compact and problem-specific network for graph convolutional networks in a progressive manner. Experimental results on two datasets for skeleton-based human action recognition indicate that the proposed method has competitive or even better classification performance.
arXiv Detail & Related papers (2020-11-11T09:57:49Z)
A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference. Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.