Beyond Labels: Advancing Cluster Analysis with the Entropy of Distance
Distribution (EDD)
- URL: http://arxiv.org/abs/2311.16621v1
- Date: Tue, 28 Nov 2023 09:22:17 GMT
- Title: Beyond Labels: Advancing Cluster Analysis with the Entropy of Distance
Distribution (EDD)
- Authors: Claus Metzner, Achim Schilling and Patrick Krauss
- Abstract summary: Entropy of Distance Distribution (EDD) is a paradigm shift in label-free clustering analysis.
Our method employs the Shannon information entropy to quantify the 'peakedness' or 'flatness' of distance distributions in a data set.
EDD's potential extends beyond conventional clustering analysis, offering a robust, scalable tool for unraveling complex data structures.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In the evolving landscape of data science, the accurate quantification of
clustering in high-dimensional data sets remains a significant challenge,
especially in the absence of predefined labels. This paper introduces a novel
approach, the Entropy of Distance Distribution (EDD), which represents a
paradigm shift in label-free clustering analysis. Traditional methods, reliant
on discrete labels, often struggle to discern intricate cluster patterns in
unlabeled data. EDD, however, leverages the characteristic differences in
pairwise point-to-point distances to discern clustering tendencies, independent
of data labeling.
Our method employs the Shannon information entropy to quantify the
'peakedness' or 'flatness' of distance distributions in a data set. This
entropy measure, normalized against its maximum value, effectively
distinguishes between strongly clustered data (indicated by pronounced peaks in
distance distribution) and more homogeneous, non-clustered data sets. This
label-free quantification is resilient against global translations and
permutations of data points, and with an additional dimension-wise z-scoring,
it becomes invariant to data set scaling.
We demonstrate the efficacy of EDD through a series of experiments involving
two-dimensional data spaces with Gaussian cluster centers. Our findings reveal
a monotonic increase in the EDD value with the widening of cluster widths,
moving from well-separated to overlapping clusters. This behavior underscores
the method's sensitivity and accuracy in detecting varying degrees of
clustering. EDD's potential extends beyond conventional clustering analysis,
offering a robust, scalable tool for unraveling complex data structures without
reliance on pre-assigned labels.
Related papers
- Continuous Contrastive Learning for Long-Tailed Semi-Supervised Recognition [50.61991746981703]
Current state-of-the-art LTSSL approaches rely on high-quality pseudo-labels for large-scale unlabeled data.
This paper introduces a novel probabilistic framework that unifies various recent proposals in long-tail learning.
We introduce a continuous contrastive learning method, CCL, extending our framework to unlabeled data using reliable and smoothed pseudo-labels.
arXiv Detail & Related papers (2024-10-08T15:06:10Z) - Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.
In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.
This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - Sanitized Clustering against Confounding Bias [38.928080236294775]
This paper presents a new clustering framework named Sanitized Clustering Against confounding Bias (SCAB)
SCAB removes the confounding factor in the semantic latent space of complex data through a non-linear dependence measure.
Experiments on complex datasets demonstrate that our SCAB achieves a significant gain in clustering performance.
arXiv Detail & Related papers (2023-11-02T14:10:14Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Likelihood Adjusted Semidefinite Programs for Clustering Heterogeneous
Data [16.153709556346417]
Clustering is a widely deployed learning tool.
iLA-SDP is less sensitive than EM to and more stable on high-dimensional data.
arXiv Detail & Related papers (2022-09-29T21:03:13Z) - Intrinsic dimension estimation for discrete metrics [65.5438227932088]
In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces.
We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting.
This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
arXiv Detail & Related papers (2022-07-20T06:38:36Z) - Enhancing cluster analysis via topological manifold learning [0.3823356975862006]
We show that inferring the topological structure of a dataset before clustering can considerably enhance cluster detection.
We combine manifold learning method UMAP for inferring the topological structure with density-based clustering method DBSCAN.
arXiv Detail & Related papers (2022-07-01T15:53:39Z) - How I learned to stop worrying and love the curse of dimensionality: an
appraisal of cluster validation in high-dimensional spaces [0.0]
We investigate how the sensitivities of common Euclidean norm-based cluster indices scale with dimension for a variety of synthetic data schemes.
We find that the overwhelming majority of indices have improved or stable sensitivity in high dimensions.
arXiv Detail & Related papers (2022-01-13T21:17:10Z) - Tensor Laplacian Regularized Low-Rank Representation for Non-uniformly
Distributed Data Subspace Clustering [2.578242050187029]
Low-Rank Representation (LRR) suffers from discarding the locality information of data points in subspace clustering.
We propose a hypergraph model to facilitate having a variable number of adjacent nodes and incorporating the locality information of the data.
Experiments on artificial and real datasets demonstrate the higher accuracy and precision of the proposed method.
arXiv Detail & Related papers (2021-03-06T08:22:24Z) - Decorrelated Clustering with Data Selection Bias [55.91842043124102]
We propose a novel Decorrelation regularized K-Means algorithm (DCKM) for clustering with data selection bias.
Our DCKM algorithm achieves significant performance gains, indicating the necessity of removing unexpected feature correlations induced by selection bias.
arXiv Detail & Related papers (2020-06-29T08:55:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.