A framework for benchmarking clustering algorithms
- URL: http://arxiv.org/abs/2209.09493v3
- Date: Wed, 25 Oct 2023 22:32:18 GMT
- Title: A framework for benchmarking clustering algorithms
- Authors: Marek Gagolewski
- Abstract summary: Clustering algorithms can be tested on a variety of benchmark problems.
Many research papers and graduate theses consider only a small number of datasets.
We have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms.
- Score: 2.900810893770134
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The evaluation of clustering algorithms can involve running them on a variety
of benchmark problems, and comparing their outputs to the reference,
ground-truth groupings provided by experts. Unfortunately, many research papers
and graduate theses consider only a small number of datasets. Also, the fact
that there can be many equally valid ways to cluster a given problem set is
rarely taken into account. In order to overcome these limitations, we have
developed a framework whose aim is to introduce a consistent methodology for
testing clustering algorithms. Furthermore, we have aggregated, polished, and
standardised many clustering benchmark dataset collections referred to across
the machine learning and data mining literature, and included new datasets of
different dimensionalities, sizes, and cluster types. An interactive datasets
explorer, the documentation of the Python API, a description of the ways to
interact with the framework from other programming languages such as R or
MATLAB, and other details are all provided at
<https://clustering-benchmarks.gagolewski.com>.
Related papers
- Dying Clusters Is All You Need -- Deep Clustering With an Unknown Number of Clusters [5.507296054825372]
Finding meaningful groups in high-dimensional data is an important challenge in data mining.
Deep clustering methods have achieved remarkable results in these tasks.
Most of these methods require the user to specify the number of clusters in advance.
This is a major limitation since the number of clusters is typically unknown if labeled data is unavailable.
Most of these approaches estimate the number of clusters separated from the clustering process.
arXiv Detail & Related papers (2024-10-12T11:04:10Z) - Generalized Category Discovery with Clustering Assignment Consistency [56.92546133591019]
Generalized category discovery (GCD) is a recently proposed open-world task.
We propose a co-training-based framework that encourages clustering consistency.
Our method achieves state-of-the-art performance on three generic benchmarks and three fine-grained visual recognition datasets.
arXiv Detail & Related papers (2023-10-30T00:32:47Z) - Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Hard Regularization to Prevent Deep Online Clustering Collapse without
Data Augmentation [65.268245109828]
Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed.
While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster.
We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments.
arXiv Detail & Related papers (2023-03-29T08:23:26Z) - High-Level Synthetic Data Generation with Data Set Archetypes [4.13592995550836]
Cluster analysis relies on effective benchmarks for evaluating and comparing different algorithms.
We propose synthetic data generation based on data set archetypes.
It is possible to set up benchmarks purely from verbal descriptions of the evaluation scenarios.
arXiv Detail & Related papers (2023-03-24T23:45:27Z) - Generating Multidimensional Clusters With Support Lines [0.0]
We present Clugen, a modular procedure for synthetic data generation.
Cluken is open source, comprehensively unit tested and documented.
We demonstrate that Clugen is fit for use in the assessment of clustering algorithms.
arXiv Detail & Related papers (2023-01-24T22:08:24Z) - Detection and Evaluation of Clusters within Sequential Data [58.720142291102135]
Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees.
In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets.
It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
arXiv Detail & Related papers (2022-10-04T15:22:39Z) - Analysis of Sparse Subspace Clustering: Experiments and Random
Projection [0.0]
Clustering is a technique that is used in many domains, such as face clustering, plant categorization, image segmentation, document classification.
We analyze one of these techniques: a powerful clustering algorithm called Sparse Subspace Clustering.
We demonstrate several experiments using this method and then introduce a new approach that can reduce the computational time required to perform sparse subspace clustering.
arXiv Detail & Related papers (2022-04-01T23:55:53Z) - Clustering Plotted Data by Image Segmentation [12.443102864446223]
Clustering algorithms are one of the main analytical methods to detect patterns in unlabeled data.
In this paper, we present a wholly different way of clustering points in 2-dimensional space, inspired by how humans cluster data.
Our approach, Visual Clustering, has several advantages over traditional clustering algorithms.
arXiv Detail & Related papers (2021-10-06T06:19:30Z) - Robust Trimmed k-means [70.88503833248159]
We propose Robust Trimmed k-means (RTKM) that simultaneously identifies outliers and clusters points.
We show RTKM performs competitively with other methods on single membership data with outliers and multi-membership data without outliers.
arXiv Detail & Related papers (2021-08-16T15:49:40Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.