HAWKS: Evolving Challenging Benchmark Sets for Cluster Analysis
- URL: http://arxiv.org/abs/2102.06940v1
- Date: Sat, 13 Feb 2021 15:01:34 GMT
- Title: HAWKS: Evolving Challenging Benchmark Sets for Cluster Analysis
- Authors: Cameron Shand, Richard Allmendinger, Julia Handl, Andrew Webb, and
John Keane
- Abstract summary: Comprehensive benchmarking of clustering algorithms is difficult.
There is no consensus regarding the best practice for rigorous benchmarking.
We demonstrate the important role evolutionary algorithms play to support flexible generation of such benchmarks.
- Score: 2.5329716878122404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Comprehensive benchmarking of clustering algorithms is rendered difficult by
two key factors: (i)~the elusiveness of a unique mathematical definition of
this unsupervised learning approach and (ii)~dependencies between the
generating models or clustering criteria adopted by some clustering algorithms
and indices for internal cluster validation. Consequently, there is no
consensus regarding the best practice for rigorous benchmarking, and whether
this is possible at all outside the context of a given application. Here, we
argue that synthetic datasets must continue to play an important role in the
evaluation of clustering algorithms, but that this necessitates constructing
benchmarks that appropriately cover the diverse set of properties that impact
clustering algorithm performance. Through our framework, HAWKS, we demonstrate
the important role evolutionary algorithms play to support flexible generation
of such benchmarks, allowing simple modification and extension. We illustrate
two possible uses of our framework: (i)~the evolution of benchmark data
consistent with a set of hand-derived properties and (ii)~the generation of
datasets that tease out performance differences between a given pair of
algorithms. Our work has implications for the design of clustering benchmarks
that sufficiently challenge a broad range of algorithms, and for furthering
insight into the strengths and weaknesses of specific approaches.
Related papers
- From A-to-Z Review of Clustering Validation Indices [4.08908337437878]
We review and evaluate the performance of internal and external clustering validation indices on the most common clustering algorithms.
We suggest a classification framework for examining the functionality of both internal and external clustering validation measures.
arXiv Detail & Related papers (2024-07-18T13:52:02Z) - GCC: Generative Calibration Clustering [55.44944397168619]
We propose a novel Generative Clustering (GCC) method to incorporate feature learning and augmentation into clustering procedure.
First, we develop a discrimirative feature alignment mechanism to discover intrinsic relationship across real and generated samples.
Second, we design a self-supervised metric learning to generate more reliable cluster assignment.
arXiv Detail & Related papers (2024-04-14T01:51:11Z) - A structured regression approach for evaluating model performance across intersectional subgroups [53.91682617836498]
Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system's performance across different subgroups.
We introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups.
arXiv Detail & Related papers (2024-01-26T14:21:45Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Detection and Evaluation of Clusters within Sequential Data [58.720142291102135]
Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees.
In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets.
It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
arXiv Detail & Related papers (2022-10-04T15:22:39Z) - Rethinking Clustering-Based Pseudo-Labeling for Unsupervised
Meta-Learning [146.11600461034746]
Method for unsupervised meta-learning, CACTUs, is a clustering-based approach with pseudo-labeling.
This approach is model-agnostic and can be combined with supervised algorithms to learn from unlabeled data.
We prove that the core reason for this is lack of a clustering-friendly property in the embedding space.
arXiv Detail & Related papers (2022-09-27T19:04:36Z) - Learning the Precise Feature for Cluster Assignment [39.320210567860485]
We propose a framework which integrates representation learning and clustering into a single pipeline for the first time.
The proposed framework exploits the powerful ability of recently developed generative models for learning intrinsic features.
Experimental results show that the performance of the proposed method is superior, or at least comparable to, the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-11T04:08:54Z) - Performance evaluation results of evolutionary clustering algorithm star
for clustering heterogeneous datasets [15.154538450706474]
This article presents the data used to evaluate the performance of evolutionary clustering algorithm star (ECA*)
Two experimental methods are employed to examine the performance of ECA* against five traditional and modern clustering algorithms.
arXiv Detail & Related papers (2021-04-30T08:17:19Z) - Fairness, Semi-Supervised Learning, and More: A General Framework for
Clustering with Stochastic Pairwise Constraints [32.19047459493177]
We introduce a novel family of emphstochastic pairwise constraints, which we incorporate into several essential clustering objectives.
We show that these constraints can succinctly model an intriguing collection of applications, including emphIndividual Fairness in clustering and emphMust-link constraints in semi-supervised learning.
arXiv Detail & Related papers (2021-03-02T20:27:58Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Simple and Scalable Sparse k-means Clustering via Feature Ranking [14.839931533868176]
We propose a novel framework for sparse k-means clustering that is intuitive, simple to implement, and competitive with state-of-the-art algorithms.
Our core method readily generalizes to several task-specific algorithms such as clustering on subsets of attributes and in partially observed data settings.
arXiv Detail & Related papers (2020-02-20T02:41:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.