Genie: A new, fast, and outlier-resistant hierarchical clustering
algorithm
- URL: http://arxiv.org/abs/2209.05757v1
- Date: Tue, 13 Sep 2022 06:42:53 GMT
- Title: Genie: A new, fast, and outlier-resistant hierarchical clustering
algorithm
- Authors: Marek Gagolewski, Maciej Bartoszuk, Anna Cena
- Abstract summary: We propose a new hierarchical clustering linkage criterion called Genie.
Our algorithm links two clusters in such a way that a chosen economic inequity measure does not drastically increase above a given threshold.
A reference implementation of the algorithm has been included in the open source 'genie' package for R.
- Score: 3.7491936479803054
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The time needed to apply a hierarchical clustering algorithm is most often
dominated by the number of computations of a pairwise dissimilarity measure.
Such a constraint, for larger data sets, puts at a disadvantage the use of all
the classical linkage criteria but the single linkage one. However, it is known
that the single linkage clustering algorithm is very sensitive to outliers,
produces highly skewed dendrograms, and therefore usually does not reflect the
true underlying data structure -- unless the clusters are well-separated. To
overcome its limitations, we propose a new hierarchical clustering linkage
criterion called Genie. Namely, our algorithm links two clusters in such a way
that a chosen economic inequity measure (e.g., the Gini- or Bonferroni-index)
of the cluster sizes does not drastically increase above a given threshold. The
presented benchmarks indicate a high practical usefulness of the introduced
method: it most often outperforms the Ward or average linkage in terms of the
clustering quality while retaining the single linkage's speed. The Genie
algorithm is easily parallelizable and thus may be run on multiple threads to
speed up its execution even further. Its memory overhead is small: there is no
need to precompute the complete distance matrix to perform the computations in
order to obtain a desired clustering. It can be applied on arbitrary spaces
equipped with a dissimilarity measure, e.g., on real vectors, DNA or protein
sequences, images, rankings, informetric data, etc. A reference implementation
of the algorithm has been included in the open source 'genie' package for R.
See also https://genieclust.gagolewski.com for a new implementation
(genieclust) -- available for both R and Python.
Related papers
- Data Aggregation for Hierarchical Clustering [0.3626013617212666]
BETULA is a numerically stable version of the well known BIRCH data aggregation algorithm.
It can be used to make HAC viable on systems with constrained resources with only small losses on clustering quality.
arXiv Detail & Related papers (2023-09-05T19:39:43Z) - Reinforcement Graph Clustering with Unknown Cluster Number [91.4861135742095]
We propose a new deep graph clustering method termed Reinforcement Graph Clustering.
In our proposed method, cluster number determination and unsupervised representation learning are unified into a uniform framework.
In order to conduct feedback actions, the clustering-oriented reward function is proposed to enhance the cohesion of the same clusters and separate the different clusters.
arXiv Detail & Related papers (2023-08-13T18:12:28Z) - A Restarted Large-Scale Spectral Clustering with Self-Guiding and Block
Diagonal Representation [1.115905690697198]
We propose a restarted clustering framework with self-guiding and block diagonal representation.
An advantage of the strategy is that some useful clustering information obtained from previous cycles could be preserved.
Theoretical results are established to show the rationality of inexact computations in spectral clustering.
arXiv Detail & Related papers (2023-06-27T01:38:52Z) - Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Hard Regularization to Prevent Deep Online Clustering Collapse without
Data Augmentation [65.268245109828]
Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed.
While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster.
We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments.
arXiv Detail & Related papers (2023-03-29T08:23:26Z) - Scalable Clustering: Large Scale Unsupervised Learning of Gaussian
Mixture Models with Outliers [5.478764356647437]
This paper introduces a provably robust clustering algorithm based on loss minimization.
It provides theoretical guarantees that the algorithm obtains high accuracy with high probability.
Experiments on real-world large-scale datasets demonstrate the effectiveness of the algorithm.
arXiv Detail & Related papers (2023-02-28T14:39:18Z) - Skew-Symmetric Adjacency Matrices for Clustering Directed Graphs [5.301300942803395]
Cut-based directed graph (digraph) clustering often focuses on finding dense within-cluster or sparse between-cluster connections.
For flow-based clusterings the edges between clusters tend to be oriented in one direction and have been found in migration data, food webs, and trade data.
arXiv Detail & Related papers (2022-03-02T20:07:04Z) - Lattice-Based Methods Surpass Sum-of-Squares in Clustering [98.46302040220395]
Clustering is a fundamental primitive in unsupervised learning.
Recent work has established lower bounds against the class of low-degree methods.
We show that, perhaps surprisingly, this particular clustering model textitdoes not exhibit a statistical-to-computational gap.
arXiv Detail & Related papers (2021-12-07T18:50:17Z) - ParChain: A Framework for Parallel Hierarchical Agglomerative Clustering
using Nearest-Neighbor Chain [6.824747267214373]
We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms.
Compared to most previous parallel HAC algorithms, our new algorithms require only linear memory, and are scalable to large data sets.
Our algorithms are able to scale to data set sizes with tens of millions of points, which existing algorithms are not able to handle.
arXiv Detail & Related papers (2021-06-08T23:13:27Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.