Related papers: CoHiRF: A Scalable and Interpretable Clustering Framework for High-Dimensional Data

CoHiRF: A Scalable and Interpretable Clustering Framework for High-Dimensional Data

URL: http://arxiv.org/abs/2502.00380v2
Date: Wed, 02 Apr 2025 19:10:01 GMT
Title: CoHiRF: A Scalable and Interpretable Clustering Framework for High-Dimensional Data
Authors: Bruno Belucci, Karim Lounici, Katia Meziani,
Abstract summary: We propose Consensus Hierarchical Random Feature (CoHiRF), a novel clustering method designed to address challenges effectively.<n>CoHiRF leverages random feature selection to mitigate noise and dimensionality effects, repeatedly applies K-Means clustering in reduced feature spaces, and combines results through a unanimous consensus criterion.<n>CoHiRF is computationally efficient with a running time comparable to K-Means, scalable to massive datasets, and exhibits robust performance against state-of-the-art methods such as SC-SRGF, HDBSCAN, and OPTICS.
Score: 0.30723404270319693
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Clustering high-dimensional data poses significant challenges due to the curse of dimensionality, scalability issues, and the presence of noisy and irrelevant features. We propose Consensus Hierarchical Random Feature (CoHiRF), a novel clustering method designed to address these challenges effectively. CoHiRF leverages random feature selection to mitigate noise and dimensionality effects, repeatedly applies K-Means clustering in reduced feature spaces, and combines results through a unanimous consensus criterion. This iterative approach constructs a cluster assignment matrix, where each row records the cluster assignments of a sample across repetitions, enabling the identification of stable clusters by comparing identical rows. Clusters are organized hierarchically, enabling the interpretation of the hierarchy to gain insights into the dataset. CoHiRF is computationally efficient with a running time comparable to K-Means, scalable to massive datasets, and exhibits robust performance against state-of-the-art methods such as SC-SRGF, HDBSCAN, and OPTICS. Experimental results on synthetic and real-world datasets confirm the method's ability to reveal meaningful patterns while maintaining scalability, making it a powerful tool for high-dimensional data analysis.

Related papers

Sparse clustering via the Deterministic Information Bottleneck algorithm [0.0]
When a cluster structure is confined to a subset of the feature space, traditional clustering techniques face unprecedented challenges.<n>We present an information-theoretic framework that overcomes the problems associated with sparse data, allowing for joint feature weighting and clustering.
arXiv Detail & Related papers (2026-01-28T14:05:44Z)
Robust Categorical Data Clustering Guided by Multi-Granular Competitive Learning [47.32771052588132]
The nested granular cluster effect is prevalent in the implicit discrete distance space of categorical data.<n>We propose a Multi-Granular Competitiveization Learning algorithm to allow potential clusters to interactively tune themselves.<n>It is shown that the proposed MGCPL-guided Categorical Data Clustering approach is competent in exploring the nested distribution of multi-granular clusters.
arXiv Detail & Related papers (2026-01-23T06:33:08Z)
Hierarchical Clustering With Confidence [6.4793198569929356]
Agglomerative hierarchical clustering is highly sensitive to small perturbations in the data.<n>We show how randomizing hierarchical clustering can be useful not just for measuring stability but also for designing valid hypothesis testing procedures.
arXiv Detail & Related papers (2025-12-06T18:18:20Z)
funOCLUST: Clustering Functional Data with Outliers [1.0435741631709405]
Functional data presents unique challenges for clustering due to their infinite-dimensional nature and potential sensitivity to outliers.<n>An extension of the OCLUST algorithm to the functional setting is proposed to address these issues.<n>The approach leverages the OCLUST framework, creating a robust method to cluster curves and trim outliers.
arXiv Detail & Related papers (2025-07-31T19:00:20Z)
CAS Condensed and Accelerated Silhouette: An Efficient Method for Determining the Optimal K in K-Means Clustering [0.0]
This paper presents strategies for selecting the optimal value of k in clustering.<n>It focuses on achieving a balance between clustering precision and computational efficiency in complex data environments.<n>The proposed approach achieves up to 99 percent faster execution times on high-dimensional datasets.
arXiv Detail & Related papers (2025-07-11T05:03:16Z)
Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning [53.527506374566485]
We propose a novel Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning cluster framework, namely AR-DBSCAN.<n>We show that AR-DBSCAN not only improves clustering accuracy by up to 144.1% and 175.3% in the NMI and ARI metrics, respectively, but also is capable of robustly finding dominant parameters.
arXiv Detail & Related papers (2025-05-07T11:37:23Z)
Line Space Clustering (LSC): Feature-Based Clustering using K-medians and Dynamic Time Warping for Versatility [0.0]
Line Space Clustering (LSC) is a representation that transforms data points into lines in a newly defined feature space. LSC employs a combined distance metric that uses Euclidean and Dynamic Time Warping (DTW) distances, weighted by a parameter alpha experiments demonstrate the efficacy of LSC on synthetic and real-world datasets.
arXiv Detail & Related papers (2025-03-20T01:27:10Z)
Graph Probability Aggregation Clustering [5.377020739388736]
We propose a graph-based fuzzy clustering algorithm that unifies the global clustering objective function with a local clustering constraint. The entire GPAC framework is formulated as a multi-constrained optimization problem, which can be solved using the Lagrangian method. Experiments conducted on synthetic, real-world, and deep learning datasets demonstrate that GPAC not only exceeds existing state-of-the-art methods in clustering performance but also excels in computational efficiency.
arXiv Detail & Related papers (2025-02-27T09:11:32Z)
Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks. We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z)
Hierarchical Sparse Representation Clustering for High-Dimensional Data Streams [16.228652652243888]
We propose a hierarchical sparse representation clustering (HSRC) method for clustering high-dimensional data streams. The experimental results obtained on several benchmark datasets demonstrate the effectiveness and robustness of HSRC.
arXiv Detail & Related papers (2024-09-07T03:40:55Z)
Adaptive Self-supervised Robust Clustering for Unstructured Data with Unknown Cluster Number [12.926206811876174]
We introduce a novel self-supervised deep clustering approach tailored for unstructured data, termed Adaptive Self-supervised Robust Clustering (ASRC) ASRC adaptively learns the graph structure and edge weights to capture both local and global structural information. ASRC even outperforms methods that rely on prior knowledge of the number of clusters, highlighting its effectiveness in addressing the challenges of clustering unstructured data.
arXiv Detail & Related papers (2024-07-29T15:51:09Z)
Fuzzy K-Means Clustering without Cluster Centroids [21.256564324236333]
Fuzzy K-Means clustering is a critical technique in unsupervised data analysis. This paper proposes a novel Fuzzy textitK-Means clustering algorithm that entirely eliminates the reliance on cluster centroids.
arXiv Detail & Related papers (2024-04-07T12:25:03Z)
Deep Embedding Clustering Driven by Sample Stability [16.53706617383543]
We propose a deep embedding clustering algorithm driven by sample stability (DECS) Specifically, we start by constructing the initial feature space with an autoencoder and then learn the cluster-oriented embedding feature constrained by sample stability. The experimental results on five datasets illustrate that the proposed method achieves superior performance compared to state-of-the-art clustering approaches.
arXiv Detail & Related papers (2024-01-29T09:19:49Z)
Detection and Evaluation of Clusters within Sequential Data [58.720142291102135]
Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees. In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets. It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
arXiv Detail & Related papers (2022-10-04T15:22:39Z)
Enhancing cluster analysis via topological manifold learning [0.3823356975862006]
We show that inferring the topological structure of a dataset before clustering can considerably enhance cluster detection. We combine manifold learning method UMAP for inferring the topological structure with density-based clustering method DBSCAN.
arXiv Detail & Related papers (2022-07-01T15:53:39Z)
Robust Trimmed k-means [70.88503833248159]
We propose Robust Trimmed k-means (RTKM) that simultaneously identifies outliers and clusters points. We show RTKM performs competitively with other methods on single membership data with outliers and multi-membership data without outliers.
arXiv Detail & Related papers (2021-08-16T15:49:40Z)
Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed. We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z)
Stable and consistent density-based clustering via multiparameter persistence [77.34726150561087]
We consider the degree-Rips construction from topological data analysis. We analyze its stability to perturbations of the input data using the correspondence-interleaving distance. We integrate these methods into a pipeline for density-based clustering, which we call Persistable.
arXiv Detail & Related papers (2020-05-18T19:45:04Z)
New advances in enumerative biclustering algorithms with online partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets. The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.