CoHiRF: A Scalable and Interpretable Clustering Framework for High-Dimensional Data
- URL: http://arxiv.org/abs/2502.00380v2
- Date: Wed, 02 Apr 2025 19:10:01 GMT
- Title: CoHiRF: A Scalable and Interpretable Clustering Framework for High-Dimensional Data
- Authors: Bruno Belucci, Karim Lounici, Katia Meziani,
- Abstract summary: We propose Consensus Hierarchical Random Feature (CoHiRF), a novel clustering method designed to address challenges effectively.<n>CoHiRF leverages random feature selection to mitigate noise and dimensionality effects, repeatedly applies K-Means clustering in reduced feature spaces, and combines results through a unanimous consensus criterion.<n>CoHiRF is computationally efficient with a running time comparable to K-Means, scalable to massive datasets, and exhibits robust performance against state-of-the-art methods such as SC-SRGF, HDBSCAN, and OPTICS.
- Score: 0.30723404270319693
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Clustering high-dimensional data poses significant challenges due to the curse of dimensionality, scalability issues, and the presence of noisy and irrelevant features. We propose Consensus Hierarchical Random Feature (CoHiRF), a novel clustering method designed to address these challenges effectively. CoHiRF leverages random feature selection to mitigate noise and dimensionality effects, repeatedly applies K-Means clustering in reduced feature spaces, and combines results through a unanimous consensus criterion. This iterative approach constructs a cluster assignment matrix, where each row records the cluster assignments of a sample across repetitions, enabling the identification of stable clusters by comparing identical rows. Clusters are organized hierarchically, enabling the interpretation of the hierarchy to gain insights into the dataset. CoHiRF is computationally efficient with a running time comparable to K-Means, scalable to massive datasets, and exhibits robust performance against state-of-the-art methods such as SC-SRGF, HDBSCAN, and OPTICS. Experimental results on synthetic and real-world datasets confirm the method's ability to reveal meaningful patterns while maintaining scalability, making it a powerful tool for high-dimensional data analysis.
Related papers
- Line Space Clustering (LSC): Feature-Based Clustering using K-medians and Dynamic Time Warping for Versatility [0.0]
Line Space Clustering (LSC) is a representation that transforms data points into lines in a newly defined feature space.
LSC employs a combined distance metric that uses Euclidean and Dynamic Time Warping (DTW) distances, weighted by a parameter alpha
experiments demonstrate the efficacy of LSC on synthetic and real-world datasets.
arXiv Detail & Related papers (2025-03-20T01:27:10Z) - Graph Probability Aggregation Clustering [5.377020739388736]
We propose a graph-based fuzzy clustering algorithm that unifies the global clustering objective function with a local clustering constraint.
The entire GPAC framework is formulated as a multi-constrained optimization problem, which can be solved using the Lagrangian method.
Experiments conducted on synthetic, real-world, and deep learning datasets demonstrate that GPAC not only exceeds existing state-of-the-art methods in clustering performance but also excels in computational efficiency.
arXiv Detail & Related papers (2025-02-27T09:11:32Z) - Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - Hierarchical Sparse Representation Clustering for High-Dimensional Data Streams [16.228652652243888]
We propose a hierarchical sparse representation clustering (HSRC) method for clustering high-dimensional data streams.
The experimental results obtained on several benchmark datasets demonstrate the effectiveness and robustness of HSRC.
arXiv Detail & Related papers (2024-09-07T03:40:55Z) - Adaptive Self-supervised Robust Clustering for Unstructured Data with Unknown Cluster Number [12.926206811876174]
We introduce a novel self-supervised deep clustering approach tailored for unstructured data, termed Adaptive Self-supervised Robust Clustering (ASRC)
ASRC adaptively learns the graph structure and edge weights to capture both local and global structural information.
ASRC even outperforms methods that rely on prior knowledge of the number of clusters, highlighting its effectiveness in addressing the challenges of clustering unstructured data.
arXiv Detail & Related papers (2024-07-29T15:51:09Z) - Fuzzy K-Means Clustering without Cluster Centroids [21.256564324236333]
Fuzzy K-Means clustering is a critical technique in unsupervised data analysis.
This paper proposes a novel Fuzzy textitK-Means clustering algorithm that entirely eliminates the reliance on cluster centroids.
arXiv Detail & Related papers (2024-04-07T12:25:03Z) - Deep Embedding Clustering Driven by Sample Stability [16.53706617383543]
We propose a deep embedding clustering algorithm driven by sample stability (DECS)
Specifically, we start by constructing the initial feature space with an autoencoder and then learn the cluster-oriented embedding feature constrained by sample stability.
The experimental results on five datasets illustrate that the proposed method achieves superior performance compared to state-of-the-art clustering approaches.
arXiv Detail & Related papers (2024-01-29T09:19:49Z) - Detection and Evaluation of Clusters within Sequential Data [58.720142291102135]
Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees.
In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets.
It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
arXiv Detail & Related papers (2022-10-04T15:22:39Z) - Enhancing cluster analysis via topological manifold learning [0.3823356975862006]
We show that inferring the topological structure of a dataset before clustering can considerably enhance cluster detection.
We combine manifold learning method UMAP for inferring the topological structure with density-based clustering method DBSCAN.
arXiv Detail & Related papers (2022-07-01T15:53:39Z) - Robust Trimmed k-means [70.88503833248159]
We propose Robust Trimmed k-means (RTKM) that simultaneously identifies outliers and clusters points.
We show RTKM performs competitively with other methods on single membership data with outliers and multi-membership data without outliers.
arXiv Detail & Related papers (2021-08-16T15:49:40Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Stable and consistent density-based clustering via multiparameter
persistence [77.34726150561087]
We consider the degree-Rips construction from topological data analysis.
We analyze its stability to perturbations of the input data using the correspondence-interleaving distance.
We integrate these methods into a pipeline for density-based clustering, which we call Persistable.
arXiv Detail & Related papers (2020-05-18T19:45:04Z) - New advances in enumerative biclustering algorithms with online
partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets.
The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.