T-SNE Is Not Optimized to Reveal Clusters in Data
- URL: http://arxiv.org/abs/2110.02573v1
- Date: Wed, 6 Oct 2021 08:35:39 GMT
- Title: T-SNE Is Not Optimized to Reveal Clusters in Data
- Authors: Zhirong Yang, Yuwei Chen, Jukka Corander
- Abstract summary: Cluster visualization is an essential task for nonlinear dimensionality reduction as a data analysis tool.
It is often believed that Student t-Distributed Neighbor Embedding (t-SNE) can show clusters for well clusterable data.
We show that t-SNE may leave clustering patterns hidden despite strong signals present in the data.
- Score: 4.03823460330412
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cluster visualization is an essential task for nonlinear dimensionality
reduction as a data analysis tool. It is often believed that Student
t-Distributed Stochastic Neighbor Embedding (t-SNE) can show clusters for well
clusterable data, with a smaller Kullback-Leibler divergence corresponding to a
better quality. There was even theoretical proof for the guarantee of this
property. However, we point out that this is not necessarily the case -- t-SNE
may leave clustering patterns hidden despite strong signals present in the
data. Extensive empirical evidence is provided to support our claim. First,
several real-world counter-examples are presented, where t-SNE fails even if
the input neighborhoods are well clusterable. Tuning hyperparameters in t-SNE
or using better optimization algorithms does not help solve this issue because
a better t-SNE learning objective can correspond to a worse cluster embedding.
Second, we check the assumptions in the clustering guarantee of t-SNE and find
they are often violated for real-world data sets.
Related papers
- Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - Multi-View Clustering via Semi-non-negative Tensor Factorization [120.87318230985653]
We develop a novel multi-view clustering based on semi-non-negative tensor factorization (Semi-NTF)
Our model directly considers the between-view relationship and exploits the between-view complementary information.
In addition, we provide an optimization algorithm for the proposed method and prove mathematically that the algorithm always converges to the stationary KKT point.
arXiv Detail & Related papers (2023-03-29T14:54:19Z) - Revised Conditional t-SNE: Looking Beyond the Nearest Neighbors [6.918364447822299]
Conditional t-SNE (ct-SNE) is a recent extension to t-SNE that allows removal of known cluster information from the embedding.
We show that ct-SNE fails in many realistic settings.
We introduce a revised method by conditioning the high-dimensional similarities instead of the low-dimensional similarities.
arXiv Detail & Related papers (2023-02-07T14:37:44Z) - Cluster-guided Contrastive Graph Clustering Network [53.16233290797777]
We propose a Cluster-guided Contrastive deep Graph Clustering network (CCGC)
We construct two views of the graph by designing special Siamese encoders whose weights are not shared between the sibling sub-networks.
To construct semantic meaningful negative sample pairs, we regard the centers of different high-confidence clusters as negative samples.
arXiv Detail & Related papers (2023-01-03T13:42:38Z) - Lattice-Based Methods Surpass Sum-of-Squares in Clustering [98.46302040220395]
Clustering is a fundamental primitive in unsupervised learning.
Recent work has established lower bounds against the class of low-degree methods.
We show that, perhaps surprisingly, this particular clustering model textitdoes not exhibit a statistical-to-computational gap.
arXiv Detail & Related papers (2021-12-07T18:50:17Z) - Stochastic Cluster Embedding [14.485496311015398]
Neighbor Embedding (NE) aims to preserve pairwise similarities between data items.
NE methods such as Neighbor Embedding (SNE) may leave large-scale patterns such as clusters hidden.
We propose a new cluster visualization method based on Neighbor Embedding.
arXiv Detail & Related papers (2021-08-18T07:07:28Z) - Distribution free optimality intervals for clustering [1.7513645771137178]
Given data $mathcalD$ and a partition $mathcalC$ of these data into $K$ clusters, when can we say that the clusters obtained are correct or meaningful for the data?
This paper introduces a paradigm in which a clustering $mathcalC$ is considered meaningful if it is good with respect to a loss function such as the K-means distortion, and stable, i.e. the only good clustering up to small perturbations.
arXiv Detail & Related papers (2021-07-30T06:13:56Z) - Improving ClusterGAN Using Self-Augmented Information Maximization of
Disentangling Latent Spaces [8.88634093297796]
We propose self-augmentation information improved ClusterGAN (SIMI-ClusterGAN) to learn the distinctive priors from the data directly.
The proposed method has been validated using seven benchmark data sets and has shown improved performance over state-of-the art methods.
arXiv Detail & Related papers (2021-07-27T10:04:32Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Robust Self-Supervised Convolutional Neural Network for Subspace
Clustering and Classification [0.10152838128195464]
This paper proposes the robust formulation of the self-supervised convolutional subspace clustering network ($S2$ConvSCN)
In a truly unsupervised training environment, Robust $S2$ConvSCN outperforms its baseline version by a significant amount for both seen and unseen data on four well-known datasets.
arXiv Detail & Related papers (2020-04-03T16:07:58Z) - Learning to Cluster Faces via Confidence and Connectivity Estimation [136.5291151775236]
We propose a fully learnable clustering framework without requiring a large number of overlapped subgraphs.
Our method significantly improves clustering accuracy and thus performance of the recognition models trained on top, yet it is an order of magnitude more efficient than existing supervised methods.
arXiv Detail & Related papers (2020-04-01T13:39:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.