A Pragmatic Method for Comparing Clusterings with Overlaps and Outliers
- URL: http://arxiv.org/abs/2602.14855v1
- Date: Mon, 16 Feb 2026 15:51:09 GMT
- Title: A Pragmatic Method for Comparing Clusterings with Overlaps and Outliers
- Authors: Ryan DeWolfe, Paweł Prałat, François Théberge,
- Abstract summary: In a general setting, the detected and ground truth clusterings may have outliers (objects belonging to no cluster), overlapping clusters (objects may belong to more than one cluster), or both.<n>In this note, we define a pragmatic similarity measure for comparing clusterings with overlaps and outliers, show that it has several desirable properties, and experimentally confirm that it is not subject to several common biases afflicting other clustering comparison measures.
- Score: 0.7646713951724009
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Clustering algorithms are an essential part of the unsupervised data science ecosystem, and extrinsic evaluation of clustering algorithms requires a method for comparing the detected clustering to a ground truth clustering. In a general setting, the detected and ground truth clusterings may have outliers (objects belonging to no cluster), overlapping clusters (objects may belong to more than one cluster), or both, but methods for comparing these clusterings are currently undeveloped. In this note, we define a pragmatic similarity measure for comparing clusterings with overlaps and outliers, show that it has several desirable properties, and experimentally confirm that it is not subject to several common biases afflicting other clustering comparison measures.
Related papers
- Break the Tie: Learning Cluster-Customized Category Relationships for Categorical Data Clustering [51.11677202873771]
Categorical attributes with qualitative values are ubiquitous in cluster analysis of real datasets.<n>Unlike the Euclidean distance of numerical attributes, the categorical attributes lack well-defined relationships of their possible values.<n>This paper breaks the intrinsic relationship tie of attribute categories and learns customized distance metrics suitable for flexibly revealing various cluster distributions.
arXiv Detail & Related papers (2025-11-12T06:57:24Z) - A Computational Theory and Semi-Supervised Algorithm for Clustering [0.0]
Clustering is the obtainment of groupings of data with no anomalies.<n>The kernel of the clustering method is the perception anomaly detection algorithm.<n>A semi-supervised clustering algorithm is presented.
arXiv Detail & Related papers (2023-06-12T09:15:58Z) - Cluster-level Group Representativity Fairness in $k$-means Clustering [3.420467786581458]
Clustering algorithms could generate clusters such that different groups are disadvantaged within different clusters.
We develop a clustering algorithm, building upon the centroid clustering paradigm pioneered by classical algorithms.
We show that our method is effective in enhancing cluster-level group representativity fairness significantly at low impact on cluster coherence.
arXiv Detail & Related papers (2022-12-29T22:02:28Z) - Seeking the Truth Beyond the Data. An Unsupervised Machine Learning
Approach [0.0]
Clustering is an unsupervised machine learning methodology where unlabeled elements/objects are grouped together.
This article provides a deep description of the most widely used clustering methodologies.
It emphasizes the comparison of these algorithms' clustering efficiency based on 3 datasets.
arXiv Detail & Related papers (2022-07-14T14:22:36Z) - Anomaly Clustering: Grouping Images into Coherent Clusters of Anomaly
Types [60.45942774425782]
We introduce anomaly clustering, whose goal is to group data into coherent clusters of anomaly types.
This is different from anomaly detection, whose goal is to divide anomalies from normal data.
We present a simple yet effective clustering framework using a patch-based pretrained deep embeddings and off-the-shelf clustering methods.
arXiv Detail & Related papers (2021-12-21T23:11:33Z) - You Never Cluster Alone [150.94921340034688]
We extend the mainstream contrastive learning paradigm to a cluster-level scheme, where all the data subjected to the same cluster contribute to a unified representation.
We define a set of categorical variables as clustering assignment confidence, which links the instance-level learning track with the cluster-level one.
By reparametrizing the assignment variables, TCC is trained end-to-end, requiring no alternating steps.
arXiv Detail & Related papers (2021-06-03T14:59:59Z) - Graph Contrastive Clustering [131.67881457114316]
We propose a novel graph contrastive learning framework, which is then applied to the clustering task and we come up with the Graph Constrastive Clustering(GCC) method.
Specifically, on the one hand, the graph Laplacian based contrastive loss is proposed to learn more discriminative and clustering-friendly features.
On the other hand, a novel graph-based contrastive learning strategy is proposed to learn more compact clustering assignments.
arXiv Detail & Related papers (2021-04-03T15:32:49Z) - Clustering Ensemble Meets Low-rank Tensor Approximation [50.21581880045667]
This paper explores the problem of clustering ensemble, which aims to combine multiple base clusterings to produce better performance than that of the individual one.
We propose a novel low-rank tensor approximation-based method to solve the problem from a global perspective.
Experimental results over 7 benchmark data sets show that the proposed model achieves a breakthrough in clustering performance, compared with 12 state-of-the-art methods.
arXiv Detail & Related papers (2020-12-16T13:01:37Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Point-Set Kernel Clustering [11.093960688450602]
This paper introduces a new similarity measure called point-set kernel which computes the similarity between an object and a set of objects.
We show that the new clustering procedure is both effective and efficient that enables it to deal with large scale datasets.
arXiv Detail & Related papers (2020-02-14T00:00:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.