Linear normalised hash function for clustering gene sequences and
identifying reference sequences from multiple sequence alignments
- URL: http://arxiv.org/abs/2311.17964v1
- Date: Wed, 29 Nov 2023 11:51:05 GMT
- Title: Linear normalised hash function for clustering gene sequences and
identifying reference sequences from multiple sequence alignments
- Authors: Manal Helal, Fanrong Kong, Sharon C-A Chen, Fei Zhou, Dominic E Dwyer,
John Potter, Vitali Sintchenko
- Abstract summary: A novel method that combines the linear mapping hash function and multiple sequence alignment (MSA) was developed.
The method was evaluated using sets of closely related (16S rRNA gene sequences of Nocardia species) and highly variable (VP1 genomic region of Enterovirus 71) sequences.
- Score: 4.34040512215583
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The aim of this study was to develop a method that would identify the cluster
centroids and the optimal number of clusters for a given sensitivity level and
could work equally well for the different sequence datasets. A novel method
that combines the linear mapping hash function and multiple sequence alignment
(MSA) was developed. This method takes advantage of the already sorted by
similarity sequences from the MSA output, and identifies the optimal number of
clusters, clusters cut-offs, and clusters centroids that can represent
reference gene vouchers for the different species. The linear mapping hash
function can map an already ordered by similarity distance matrix to indices to
reveal gaps in the values around which the optimal cut-offs of the different
clusters can be identified. The method was evaluated using sets of closely
related (16S rRNA gene sequences of Nocardia species) and highly variable (VP1
genomic region of Enterovirus 71) sequences and outperformed existing
unsupervised machine learning clustering methods and dimensionality reduction
methods. This method does not require prior knowledge of the number of clusters
or the distance between clusters, handles clusters of different sizes and
shapes, and scales linearly with the dataset. The combination of MSA with the
linear mapping hash function is a computationally efficient way of gene
sequence clustering and can be a valuable tool for the assessment of
similarity, clustering of different microbial genomes, identifying reference
sequences, and for the study of evolution of bacteria and viruses.
Related papers
- HBIC: A Biclustering Algorithm for Heterogeneous Datasets [0.0]
Biclustering is an unsupervised machine-learning approach aiming to cluster rows and columns simultaneously in a data matrix.
We introduce a biclustering approach called HBIC, capable of discovering meaningful biclusters in complex heterogeneous data.
arXiv Detail & Related papers (2024-08-23T16:48:10Z) - Adaptive Graph Convolutional Subspace Clustering [10.766537212211217]
Spectral-type subspace clustering algorithms have shown excellent performance in many subspace clustering applications.
In this paper, inspired by graph convolutional networks, we use the graph convolution technique to develop a feature extraction method and a coefficient matrix constraint simultaneously.
We claim that by using AGCSC, the aggregated feature representation of original data samples is suitable for subspace clustering.
arXiv Detail & Related papers (2023-05-05T10:27:23Z) - Evaluating COVID-19 Sequence Data Using Nearest-Neighbors Based Network
Model [0.0]
SARS-CoV-2 coronavirus is the cause of the COVID-19 disease in humans.
It can adapt to different hosts and evolve into different lineages.
It is well-known that the major SARS-CoV-2 lineages are characterized by mutations that happen predominantly in the spike protein.
arXiv Detail & Related papers (2022-11-19T00:34:02Z) - Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences.
Exact methods yield better classification performance, but they pose high computational costs.
We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z) - Clustering Ensemble Meets Low-rank Tensor Approximation [50.21581880045667]
This paper explores the problem of clustering ensemble, which aims to combine multiple base clusterings to produce better performance than that of the individual one.
We propose a novel low-rank tensor approximation-based method to solve the problem from a global perspective.
Experimental results over 7 benchmark data sets show that the proposed model achieves a breakthrough in clustering performance, compared with 12 state-of-the-art methods.
arXiv Detail & Related papers (2020-12-16T13:01:37Z) - Multi-View Spectral Clustering with High-Order Optimal Neighborhood
Laplacian Matrix [57.11971786407279]
Multi-view spectral clustering can effectively reveal the intrinsic cluster structure among data.
This paper proposes a multi-view spectral clustering algorithm that learns a high-order optimal neighborhood Laplacian matrix.
Our proposed algorithm generates the optimal Laplacian matrix by searching the neighborhood of the linear combination of both the first-order and high-order base.
arXiv Detail & Related papers (2020-08-31T12:28:40Z) - A Novel Granular-Based Bi-Clustering Method of Deep Mining the
Co-Expressed Genes [76.84066556597342]
Bi-clustering methods are used to mine bi-clusters whose subsets of samples (genes) are co-regulated under their test conditions.
Unfortunately, traditional bi-clustering methods are not fully effective in discovering such bi-clusters.
We propose a novel bi-clustering method by involving here the theory of Granular Computing.
arXiv Detail & Related papers (2020-05-12T02:04:40Z) - Conjoined Dirichlet Process [63.89763375457853]
We develop a novel, non-parametric probabilistic biclustering method based on Dirichlet processes to identify biclusters with strong co-occurrence in both rows and columns.
We apply our method to two different applications, text mining and gene expression analysis, and demonstrate that our method improves bicluster extraction in many settings compared to existing approaches.
arXiv Detail & Related papers (2020-02-08T19:41:23Z) - Clustering Binary Data by Application of Combinatorial Optimization
Heuristics [52.77024349608834]
We study clustering methods for binary data, first defining aggregation criteria that measure the compactness of clusters.
Five new and original methods are introduced, using neighborhoods and population behavior optimization metaheuristics.
From a set of 16 data tables generated by a quasi-Monte Carlo experiment, a comparison is performed for one of the aggregations using L1 dissimilarity, with hierarchical clustering, and a version of k-means: partitioning around medoids or PAM.
arXiv Detail & Related papers (2020-01-06T23:33:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.