BETULA: Numerically Stable CF-Trees for BIRCH Clustering
- URL: http://arxiv.org/abs/2006.12881v1
- Date: Tue, 23 Jun 2020 10:20:36 GMT
- Title: BETULA: Numerically Stable CF-Trees for BIRCH Clustering
- Authors: Andreas Lang and Erich Schubert
- Abstract summary: BIRCH clustering is a widely known approach for clustering, that has influenced much subsequent research and commercial products.
We introduce a replacement cluster feature that does not have the numeric problem, that is not much more expensive to maintain, and which makes many computations simpler and hence more efficient.
- Score: 0.76146285961466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: BIRCH clustering is a widely known approach for clustering, that has
influenced much subsequent research and commercial products. The key
contribution of BIRCH is the Clustering Feature tree (CF-Tree), which is a
compressed representation of the input data. As new data arrives, the tree is
eventually rebuilt to increase the compression. Afterward, the leaves of the
tree are used for clustering. Because of the data compression, this method is
very scalable. The idea has been adopted for example for k-means, data stream,
and density-based clustering.
Clustering features used by BIRCH are simple summary statistics that can
easily be updated with new data: the number of points, the linear sums, and the
sum of squared values. Unfortunately, how the sum of squares is then used in
BIRCH is prone to catastrophic cancellation.
We introduce a replacement cluster feature that does not have this numeric
problem, that is not much more expensive to maintain, and which makes many
computations simpler and hence more efficient. These cluster features can also
easily be used in other work derived from BIRCH, such as algorithms for
streaming data. In the experiments, we demonstrate the numerical problem and
compare the performance of the original algorithm compared to the improved
cluster features.
Related papers
- Accelerating k-Means Clustering with Cover Trees [0.30693357740321775]
We propose a new k-means algorithm based on the cover tree index, that has relatively low overhead and performs well.
We obtain a hybrid algorithm that combines the benefits of tree aggregation and bounds-based filtering.
arXiv Detail & Related papers (2024-10-19T14:02:42Z) - Linear time Evidence Accumulation Clustering with KMeans [0.0]
This work describes a trick which mimic the behavior of average linkage clustering.
We found a way of computing efficiently the density of a partitioning, reducing the cost from a quadratic to linear complexity.
The k-means results are comparable to the best state of the art in terms of NMI while keeping the computational cost low.
arXiv Detail & Related papers (2023-11-15T14:12:59Z) - Reinforcement Graph Clustering with Unknown Cluster Number [91.4861135742095]
We propose a new deep graph clustering method termed Reinforcement Graph Clustering.
In our proposed method, cluster number determination and unsupervised representation learning are unified into a uniform framework.
In order to conduct feedback actions, the clustering-oriented reward function is proposed to enhance the cohesion of the same clusters and separate the different clusters.
arXiv Detail & Related papers (2023-08-13T18:12:28Z) - Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Dink-Net: Neural Clustering on Large Graphs [59.10189693120368]
A deep graph clustering method (Dink-Net) is proposed with the idea of dilation and shrink.
By discriminating nodes, whether being corrupted by augmentations, representations are learned in a self-supervised manner.
The clustering distribution is optimized by minimizing the proposed cluster dilation loss and cluster shrink loss.
Compared to the runner-up, Dink-Net 9.62% achieves NMI improvement on the ogbn-papers100M dataset with 111 million nodes and 1.6 billion edges.
arXiv Detail & Related papers (2023-05-28T15:33:24Z) - GBMST: An Efficient Minimum Spanning Tree Clustering Based on
Granular-Ball Computing [78.92205914422925]
We propose a clustering algorithm that combines multi-granularity Granular-Ball and minimum spanning tree (MST)
We construct coarsegrained granular-balls, and then use granular-balls and MST to implement the clustering method based on "large-scale priority"
Experimental results on several data sets demonstrate the power of the algorithm.
arXiv Detail & Related papers (2023-03-02T09:04:35Z) - Recovering Unbalanced Communities in the Stochastic Block Model With
Application to Clustering with a Faulty Oracle [9.578056676899203]
oracle block model (SBM) is a fundamental model for studying graph clustering or community detection in networks.
We provide a simple SVD-based algorithm for recovering the communities in the SBM with communities of varying sizes.
arXiv Detail & Related papers (2022-02-17T08:51:19Z) - Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering.
We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure.
We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z) - Clustering of Big Data with Mixed Features [3.3504365823045044]
We develop a new clustering algorithm for large data of mixed type.
The algorithm is capable of detecting outliers and clusters of relatively lower density values.
We present experimental results to verify that our algorithm works well in practice.
arXiv Detail & Related papers (2020-11-11T19:54:38Z) - Spectral Clustering with Smooth Tiny Clusters [14.483043753721256]
We propose a novel clustering algorithm, which con-siders the smoothness of data for the first time.
Our key idea is to cluster tiny clusters, whose centers constitute smooth graphs.
Although in this paper, we singly focus on multi-scale situations, the idea of data smoothness can certainly be extended to any clustering algorithms.
arXiv Detail & Related papers (2020-09-10T05:21:20Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.