Data Skeleton Learning: Scalable Active Clustering with Sparse Graph Structures
- URL: http://arxiv.org/abs/2509.08530v1
- Date: Wed, 10 Sep 2025 12:18:52 GMT
- Title: Data Skeleton Learning: Scalable Active Clustering with Sparse Graph Structures
- Authors: Wen-Bo Xie, Xun Fu, Bin Chen, Yan-Li Lee, Tao Deng, Tian Zou, Xin Wang, Zhen Liu, Jaideep Srivastavad,
- Abstract summary: We propose a graph-based active clustering algorithm that utilizes two sparse graphs.<n>These two graphs work in concert, enabling the refinement of connected subgraphs within the data skeleton to create nested clusters.<n>Our empirical analysis confirms that the proposed algorithm consistently facilitates more accurate clustering with dramatically less input of user-provided constraints.
- Score: 14.417696261026492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we focus on the efficiency and scalability of pairwise constraint-based active clustering, crucial for processing large-scale data in applications such as data mining, knowledge annotation, and AI model pre-training. Our goals are threefold: (1) to reduce computational costs for iterative clustering updates; (2) to enhance the impact of user-provided constraints to minimize annotation requirements for precise clustering; and (3) to cut down memory usage in practical deployments. To achieve these aims, we propose a graph-based active clustering algorithm that utilizes two sparse graphs: one for representing relationships between data (our proposed data skeleton) and another for updating this data skeleton. These two graphs work in concert, enabling the refinement of connected subgraphs within the data skeleton to create nested clusters. Our empirical analysis confirms that the proposed algorithm consistently facilitates more accurate clustering with dramatically less input of user-provided constraints, and outperforms its counterparts in terms of computational performance and scalability, while maintaining robustness across various distance metrics.
Related papers
- Efficient Identification of High Similarity Clusters in Polygon Datasets [0.0]
We propose a framework that reduces the number of clusters requiring verification, thereby decreasing the computational load on these systems.<n>The framework integrates dynamic similarity index thresholding, supervised scheduling, and recall-constrained optimization.<n>Our approach achieves substantial reductions in computational cost without sacrificing accuracy.
arXiv Detail & Related papers (2025-09-28T15:39:15Z) - A Clustering Method with Graph Maximum Decoding Information [6.11503045313947]
We present a novel clustering method for maximizing decoding information within graph-based models, named CMDI.<n> CMDI incorporates two-dimensional structural information theory into the clustering process, consisting of two phases: graph structure extraction and graph partitioning.<n> Empirical evaluations on three real-world datasets demonstrate that CMDI outperforms classical baseline methods, exhibiting a superior decoding information ratio (DI-R)<n>These findings underscore the effectiveness of CMDI in enhancing decoding information quality and computational efficiency, positioning it as a valuable tool in graph-based clustering analyses.
arXiv Detail & Related papers (2024-03-18T05:18:19Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Redundancy-Free Self-Supervised Relational Learning for Graph Clustering [13.176413653235311]
We propose a novel self-supervised deep graph clustering method named Redundancy-Free Graph Clustering (R$2$FGC)
It extracts the attribute- and structure-level relational information from both global and local views based on an autoencoder and a graph autoencoder.
Our experiments are performed on widely used benchmark datasets to validate the superiority of our R$2$FGC over state-of-the-art baselines.
arXiv Detail & Related papers (2023-09-09T06:18:50Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Simple Contrastive Graph Clustering [41.396185271303956]
We propose a Simple Contrastive Graph Clustering (SCGC) algorithm to improve the existing methods.
Our algorithm outperforms the recent contrastive deep clustering competitors with at least seven times speedup on average.
arXiv Detail & Related papers (2022-05-11T06:45:19Z) - Meta Clustering Learning for Large-scale Unsupervised Person
Re-identification [124.54749810371986]
We propose a "small data for big task" paradigm dubbed Meta Clustering Learning (MCL)
MCL only pseudo-labels a subset of the entire unlabeled data via clustering to save computing for the first-phase training.
Our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.
arXiv Detail & Related papers (2021-11-19T04:10:18Z) - Effective and Efficient Graph Learning for Multi-view Clustering [173.8313827799077]
We propose an effective and efficient graph learning model for multi-view clustering.
Our method exploits the view-similar between graphs of different views by the minimization of tensor Schatten p-norm.
Our proposed algorithm is time-economical and obtains the stable results and scales well with the data size.
arXiv Detail & Related papers (2021-08-15T13:14:28Z) - DANCE: DAta-Network Co-optimization for Efficient Segmentation Model Training and Inference [86.03382625531951]
DANCE is an automated simultaneous data-network co-optimization for efficient segmentation model training and inference.<n>It integrates automated data slimming which adaptively downsamples/drops input images and controls their corresponding contribution to the training loss guided by the images' spatial complexity.<n>Experiments and ablating studies demonstrate that DANCE can achieve "all-win" towards efficient segmentation.
arXiv Detail & Related papers (2021-07-16T04:58:58Z) - Learning to Cluster Faces via Confidence and Connectivity Estimation [136.5291151775236]
We propose a fully learnable clustering framework without requiring a large number of overlapped subgraphs.
Our method significantly improves clustering accuracy and thus performance of the recognition models trained on top, yet it is an order of magnitude more efficient than existing supervised methods.
arXiv Detail & Related papers (2020-04-01T13:39:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.