Data Aggregation for Hierarchical Clustering
- URL: http://arxiv.org/abs/2309.02552v1
- Date: Tue, 5 Sep 2023 19:39:43 GMT
- Title: Data Aggregation for Hierarchical Clustering
- Authors: Erich Schubert and Andreas Lang
- Abstract summary: BETULA is a numerically stable version of the well known BIRCH data aggregation algorithm.
It can be used to make HAC viable on systems with constrained resources with only small losses on clustering quality.
- Score: 0.3626013617212666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hierarchical Agglomerative Clustering (HAC) is likely the earliest and most
flexible clustering method, because it can be used with many distances,
similarities, and various linkage strategies. It is often used when the number
of clusters the data set forms is unknown and some sort of hierarchy in the
data is plausible. Most algorithms for HAC operate on a full distance matrix,
and therefore require quadratic memory. The standard algorithm also has cubic
runtime to produce a full hierarchy. Both memory and runtime are especially
problematic in the context of embedded or otherwise very resource-constrained
systems. In this section, we present how data aggregation with BETULA, a
numerically stable version of the well known BIRCH data aggregation algorithm,
can be used to make HAC viable on systems with constrained resources with only
small losses on clustering quality, and hence allow exploratory data analysis
of very large data sets.
Related papers
- Hierarchical Clustering using Reversible Binary Cellular Automata for High-Dimensional Data [0.0]
In cellular automaton (CA) based clustering, if two objects belong to the same cycle, they are closely related and considered as part of the same cluster.
This paper identifies the relationship between objects in two different cycles based on the median of all elements in each cycle so that they can be grouped in the next stage.
When verified over standard benchmark datasets with various performance metrics, our algorithm is at par with the existing algorithms with quadratic time complexity.
arXiv Detail & Related papers (2024-08-05T05:48:45Z) - Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Hard Regularization to Prevent Deep Online Clustering Collapse without
Data Augmentation [65.268245109828]
Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed.
While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster.
We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments.
arXiv Detail & Related papers (2023-03-29T08:23:26Z) - GBMST: An Efficient Minimum Spanning Tree Clustering Based on
Granular-Ball Computing [78.92205914422925]
We propose a clustering algorithm that combines multi-granularity Granular-Ball and minimum spanning tree (MST)
We construct coarsegrained granular-balls, and then use granular-balls and MST to implement the clustering method based on "large-scale priority"
Experimental results on several data sets demonstrate the power of the algorithm.
arXiv Detail & Related papers (2023-03-02T09:04:35Z) - Scalable Clustering: Large Scale Unsupervised Learning of Gaussian
Mixture Models with Outliers [5.478764356647437]
This paper introduces a provably robust clustering algorithm based on loss minimization.
It provides theoretical guarantees that the algorithm obtains high accuracy with high probability.
Experiments on real-world large-scale datasets demonstrate the effectiveness of the algorithm.
arXiv Detail & Related papers (2023-02-28T14:39:18Z) - Genie: A new, fast, and outlier-resistant hierarchical clustering
algorithm [3.7491936479803054]
We propose a new hierarchical clustering linkage criterion called Genie.
Our algorithm links two clusters in such a way that a chosen economic inequity measure does not drastically increase above a given threshold.
A reference implementation of the algorithm has been included in the open source 'genie' package for R.
arXiv Detail & Related papers (2022-09-13T06:42:53Z) - Robust Trimmed k-means [70.88503833248159]
We propose Robust Trimmed k-means (RTKM) that simultaneously identifies outliers and clusters points.
We show RTKM performs competitively with other methods on single membership data with outliers and multi-membership data without outliers.
arXiv Detail & Related papers (2021-08-16T15:49:40Z) - Scaling Hierarchical Agglomerative Clustering to Billion-sized Datasets [0.0]
We propose Reciprocal Agglomerative Clustering (RAC), a distributed algorithm for HAC, that uses a novel strategy to efficiently merge clusters in parallel.
In extensive experiments, we show that RAC can recover the HAC hierarchy on billions of data points connected by trillions of edges in less than an hour.
arXiv Detail & Related papers (2021-05-25T04:14:21Z) - Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering.
We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure.
We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z) - Similarity-based Distance for Categorical Clustering using Space
Structure [5.543220407902113]
We have proposed a novel distance metric, similarity-based distance (SBD) to find the distance between objects of categorical data.
Our proposed distance (SBD) significantly outperforms the existing algorithms like k-modes or other SBC type algorithms when used on categorical datasets.
arXiv Detail & Related papers (2020-11-19T15:18:26Z) - Learnable Subspace Clustering [76.2352740039615]
We develop a learnable subspace clustering paradigm to efficiently solve the large-scale subspace clustering problem.
The key idea is to learn a parametric function to partition the high-dimensional subspaces into their underlying low-dimensional subspaces.
To the best of our knowledge, this paper is the first work to efficiently cluster millions of data points among the subspace clustering methods.
arXiv Detail & Related papers (2020-04-09T12:53:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.