GBSK: Skeleton Clustering via Granular-ball Computing and Multi-Sampling for Large-Scale Data
- URL: http://arxiv.org/abs/2509.23742v1
- Date: Sun, 28 Sep 2025 08:41:15 GMT
- Title: GBSK: Skeleton Clustering via Granular-ball Computing and Multi-Sampling for Large-Scale Data
- Authors: Yewang Chen, Junfeng Li, Shuyin Xia, Qinghong Lai, Xinbo Gao, Guoyin Wang, Dongdong Cheng, Yi Liu, Yi Wang,
- Abstract summary: We propose a novel scalable skeleton clustering algorithm, namely GBSK, to handle clustering task for large-scale datasets.<n>By multi-sampling the dataset and constructing multi-grained granular-balls, GBSK progressively uncovers a statistical "skeleton"<n>In addition, we introduce an adaptive version, AGBSK, with simplified parameter settings to enhance usability and facilitate deployment in real-world scenarios.
- Score: 62.363178614776295
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To effectively handle clustering task for large-scale datasets, we propose a novel scalable skeleton clustering algorithm, namely GBSK, which leverages the granular-ball technique to capture the underlying structure of data. By multi-sampling the dataset and constructing multi-grained granular-balls, GBSK progressively uncovers a statistical "skeleton" -- a spatial abstraction that approximates the essential structure and distribution of the original data. This strategy enables GBSK to dramatically reduce computational overhead while maintaining high clustering accuracy. In addition, we introduce an adaptive version, AGBSK, with simplified parameter settings to enhance usability and facilitate deployment in real-world scenarios. Extensive experiments conducted on standard computing hardware demonstrate that GBSK achieves high efficiency and strong clustering performance on large-scale datasets, including one with up to 100 million instances across 256 dimensions. Our implementation and experimental results are available at: https://github.com/XFastDataLab/GBSK/.
Related papers
- Robust Categorical Data Clustering Guided by Multi-Granular Competitive Learning [47.32771052588132]
The nested granular cluster effect is prevalent in the implicit discrete distance space of categorical data.<n>We propose a Multi-Granular Competitiveization Learning algorithm to allow potential clusters to interactively tune themselves.<n>It is shown that the proposed MGCPL-guided Categorical Data Clustering approach is competent in exploring the nested distribution of multi-granular clusters.
arXiv Detail & Related papers (2026-01-23T06:33:08Z) - Persistent Multiscale Density-based Clustering [0.515435457943463]
Persistent Leaves Spatial Clustering for Applications with Noise (PLSCAN)<n>PLSCAN efficiently identifies all minimum cluster sizes for which HDBSCAN* produces stable (leaf) clusters.<n>We compare PLSCAN's performance to HDBSCAN* on several real-world datasets.
arXiv Detail & Related papers (2025-12-18T14:01:35Z) - Scalable Context-Preserving Model-Aware Deep Clustering for Hyperspectral Images [51.95768218975529]
Subspace clustering has become widely adopted for the unsupervised analysis of hyperspectral images (HSIs)<n>Recent model-aware deep subspace clustering methods often use a two-stage framework, involving the calculation of a self-representation matrix with complexity of O(n2), followed by spectral clustering.<n>We propose a scalable, context-preserving deep clustering method based on basis representation, which jointly captures local and non-local structures for efficient HSI clustering.
arXiv Detail & Related papers (2025-06-12T16:43:09Z) - Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning [53.527506374566485]
We propose a novel Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning cluster framework, namely AR-DBSCAN.<n>We show that AR-DBSCAN not only improves clustering accuracy by up to 144.1% and 175.3% in the NMI and ARI metrics, respectively, but also is capable of robustly finding dominant parameters.
arXiv Detail & Related papers (2025-05-07T11:37:23Z) - CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training [63.07024608399447]
We propose an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting.<n>We introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset.
arXiv Detail & Related papers (2025-04-17T17:58:13Z) - CoHiRF: A Scalable and Interpretable Clustering Framework for High-Dimensional Data [0.30723404270319693]
We propose Consensus Hierarchical Random Feature (CoHiRF), a novel clustering method designed to address challenges effectively.<n>CoHiRF leverages random feature selection to mitigate noise and dimensionality effects, repeatedly applies K-Means clustering in reduced feature spaces, and combines results through a unanimous consensus criterion.<n>CoHiRF is computationally efficient with a running time comparable to K-Means, scalable to massive datasets, and exhibits robust performance against state-of-the-art methods such as SC-SRGF, HDBSCAN, and OPTICS.
arXiv Detail & Related papers (2025-02-01T09:38:44Z) - ShapeSplat: A Large-scale Dataset of Gaussian Splats and Their Self-Supervised Pretraining [112.40071212468843]
3D Gaussian Splatting (3DGS) has become the de facto method of 3D representation in many vision tasks.<n>We build a large-scale dataset of 3DGS using the commonly used ShapeNet, ModelNet and averse.<n>We introduce Gaussian-MAE, which highlights the unique benefits of representation learning from Gaussian parameters.
arXiv Detail & Related papers (2024-08-20T14:49:14Z) - Toward Efficient and Incremental Spectral Clustering via Parametric
Spectral Clustering [2.44755919161855]
Spectral clustering is a popular method for effectively clustering nonlinearly separable data.
This paper introduces a novel approach called parametric spectral clustering (PSC)
PSC addresses the challenges associated with big data and real-time scenarios.
arXiv Detail & Related papers (2023-11-14T01:26:20Z) - Variational Auto Encoder Gradient Clustering [0.0]
Clustering using deep neural network models have been extensively studied in recent years.
This article investigates how probability function gradient ascent can be used to process data in order to achieve better clustering.
We propose a simple yet effective method for investigating suitable number of clusters for data, based on the DBSCAN clustering algorithm.
arXiv Detail & Related papers (2021-05-11T08:00:36Z) - k-Factorization Subspace Clustering [12.18340575383456]
Subspace clustering aims to cluster data lying in a union of low-dimensional subspaces.
This paper presents a method called k-Factorization Subspace Clustering (k-FSC) for large-scale subspace clustering.
arXiv Detail & Related papers (2020-12-08T10:34:21Z) - Graph Convolutional Subspace Clustering: A Robust Subspace Clustering
Framework for Hyperspectral Image [6.332208511335129]
We present a novel subspace clustering framework called Graph Convolutional Subspace Clustering (GCSC) for robust HSI clustering.
Specifically, the framework recasts the self-expressiveness property of the data into the non-Euclidean domain.
We show that traditional subspace clustering models are the special forms of our framework with the Euclidean data.
arXiv Detail & Related papers (2020-04-22T10:09:19Z) - Learnable Subspace Clustering [76.2352740039615]
We develop a learnable subspace clustering paradigm to efficiently solve the large-scale subspace clustering problem.
The key idea is to learn a parametric function to partition the high-dimensional subspaces into their underlying low-dimensional subspaces.
To the best of our knowledge, this paper is the first work to efficiently cluster millions of data points among the subspace clustering methods.
arXiv Detail & Related papers (2020-04-09T12:53:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.