How I learned to stop worrying and love the curse of dimensionality: an
appraisal of cluster validation in high-dimensional spaces
- URL: http://arxiv.org/abs/2201.05214v1
- Date: Thu, 13 Jan 2022 21:17:10 GMT
- Title: How I learned to stop worrying and love the curse of dimensionality: an
appraisal of cluster validation in high-dimensional spaces
- Authors: Brian A. Powell
- Abstract summary: We investigate how the sensitivities of common Euclidean norm-based cluster indices scale with dimension for a variety of synthetic data schemes.
We find that the overwhelming majority of indices have improved or stable sensitivity in high dimensions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The failure of the Euclidean norm to reliably distinguish between nearby and
distant points in high dimensional space is well-known. This phenomenon of
distance concentration manifests in a variety of data distributions, with iid
or correlated features, including centrally-distributed and clustered data.
Unsupervised learning based on Euclidean nearest-neighbors and more general
proximity-oriented data mining tasks like clustering, might therefore be
adversely affected by distance concentration for high-dimensional applications.
While considerable work has been done developing clustering algorithms with
reliable high-dimensional performance, the problem of cluster validation--of
determining the natural number of clusters in a dataset--has not been carefully
examined in high-dimensional problems. In this work we investigate how the
sensitivities of common Euclidean norm-based cluster validity indices scale
with dimension for a variety of synthetic data schemes, including
well-separated and noisy clusters, and find that the overwhelming majority of
indices have improved or stable sensitivity in high dimensions. The curse of
dimensionality is therefore dispelled for this class of fairly generic data
schemes.
Related papers
- SHADE: Deep Density-based Clustering [13.629470968274]
SHADE is the first deep clustering algorithm that incorporates density-connectivity into its loss function.
It supports high-dimensional and large data sets with the expressive power of a deep autoencoder.
It outperforms existing methods in clustering quality, especially on data that contain non-Gaussian clusters.
arXiv Detail & Related papers (2024-10-08T18:03:35Z) - Deep Clustering Evaluation: How to Validate Internal Clustering Validation Measures [2.2252684361733284]
Deep clustering is a method for partitioning complex, high-dimensional data using deep neural networks.
Traditional clustering validation measures, designed for low-dimensional spaces, are problematic for deep clustering.
This paper addresses these challenges in evaluating clustering quality in deep learning.
arXiv Detail & Related papers (2024-03-21T20:43:44Z) - Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.
In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.
This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - A provable initialization and robust clustering method for general mixture models [6.806940901668607]
Clustering is a fundamental tool in statistical machine learning in the presence of heterogeneous data.
Most recent results focus on optimal mislabeling guarantees when data are distributed around centroids with sub-Gaussian errors.
arXiv Detail & Related papers (2024-01-10T22:56:44Z) - Beyond Labels: Advancing Cluster Analysis with the Entropy of Distance
Distribution (EDD) [0.0]
Entropy of Distance Distribution (EDD) is a paradigm shift in label-free clustering analysis.
Our method employs the Shannon information entropy to quantify the 'peakedness' or 'flatness' of distance distributions in a data set.
EDD's potential extends beyond conventional clustering analysis, offering a robust, scalable tool for unraveling complex data structures.
arXiv Detail & Related papers (2023-11-28T09:22:17Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Intrinsic dimension estimation for discrete metrics [65.5438227932088]
In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces.
We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting.
This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
arXiv Detail & Related papers (2022-07-20T06:38:36Z) - Anomaly Clustering: Grouping Images into Coherent Clusters of Anomaly
Types [60.45942774425782]
We introduce anomaly clustering, whose goal is to group data into coherent clusters of anomaly types.
This is different from anomaly detection, whose goal is to divide anomalies from normal data.
We present a simple yet effective clustering framework using a patch-based pretrained deep embeddings and off-the-shelf clustering methods.
arXiv Detail & Related papers (2021-12-21T23:11:33Z) - Density-Based Clustering with Kernel Diffusion [59.4179549482505]
A naive density corresponding to the indicator function of a unit $d$-dimensional Euclidean ball is commonly used in density-based clustering algorithms.
We propose a new kernel diffusion density function, which is adaptive to data of varying local distributional characteristics and smoothness.
arXiv Detail & Related papers (2021-10-11T09:00:33Z) - Deep Semi-Supervised Embedded Clustering (DSEC) for Stratification of
Heart Failure Patients [50.48904066814385]
In this work we apply deep semi-supervised embedded clustering to determine data-driven patient subgroups of heart failure.
We find clinically relevant clusters from an embedded space derived from heterogeneous data.
The proposed algorithm can potentially find new undiagnosed subgroups of patients that have different outcomes.
arXiv Detail & Related papers (2020-12-24T12:56:46Z) - On clustering uncertain and structured data with Wasserstein barycenters
and a geodesic criterion for the number of clusters [0.0]
This work considers the notion of Wasserstein barycenters, accompanied by appropriate clustering indices based on the intrinsic geometry of the Wasserstein space where the clustering task is performed.
Such type of clustering approaches are highly appreciated in many fields where the observational/experimental error is significant.
Under this perspective, each observation is identified by an appropriate probability measure and the proposed clustering schemes rely on discrimination criteria.
arXiv Detail & Related papers (2019-12-26T08:46:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.