MNIST-Nd: a set of naturalistic datasets to benchmark clustering across dimensions
- URL: http://arxiv.org/abs/2410.16124v1
- Date: Mon, 21 Oct 2024 15:51:30 GMT
- Title: MNIST-Nd: a set of naturalistic datasets to benchmark clustering across dimensions
- Authors: Polina Turishcheva, Laura Hansel, Martin Ritzert, Marissa A. Weis, Alexander S. Ecker,
- Abstract summary: We propose MNIST-Nd, a set of synthetic datasets that share a key property of real-world datasets.
MNIST-Nd is obtained by training mixture variational autoencoders with 2 to 64 latent dimensions on MNIST.
Preliminary common clustering algorithm benchmarks on MNIST-Nd suggest that Leiden is the most robust for growing dimensions.
- Score: 46.67219141114834
- License:
- Abstract: Driven by advances in recording technology, large-scale high-dimensional datasets have emerged across many scientific disciplines. Especially in biology, clustering is often used to gain insights into the structure of such datasets, for instance to understand the organization of different cell types. However, clustering is known to scale poorly to high dimensions, even though the exact impact of dimensionality is unclear as current benchmark datasets are mostly two-dimensional. Here we propose MNIST-Nd, a set of synthetic datasets that share a key property of real-world datasets, namely that individual samples are noisy and clusters do not perfectly separate. MNIST-Nd is obtained by training mixture variational autoencoders with 2 to 64 latent dimensions on MNIST, resulting in six datasets with comparable structure but varying dimensionality. It thus offers the chance to disentangle the impact of dimensionality on clustering. Preliminary common clustering algorithm benchmarks on MNIST-Nd suggest that Leiden is the most robust for growing dimensions.
Related papers
- Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.
In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.
This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - Transferable Deep Metric Learning for Clustering [1.2762298148425795]
Clustering in high spaces is a difficult task; the usual dimension distance metrics may no longer be appropriate under the curse of dimensionality.
We show that we can learn a metric on a labelled dataset, then apply it to cluster a different dataset.
We achieve results competitive with the state-of-the-art while using only a small number of labelled training datasets and shallow networks.
arXiv Detail & Related papers (2023-02-13T17:09:59Z) - Adaptively-weighted Integral Space for Fast Multiview Clustering [54.177846260063966]
We propose an Adaptively-weighted Integral Space for Fast Multiview Clustering (AIMC) with nearly linear complexity.
Specifically, view generation models are designed to reconstruct the view observations from the latent integral space.
Experiments conducted on several realworld datasets confirm the superiority of the proposed AIMC method.
arXiv Detail & Related papers (2022-08-25T05:47:39Z) - Intrinsic dimension estimation for discrete metrics [65.5438227932088]
In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces.
We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting.
This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
arXiv Detail & Related papers (2022-07-20T06:38:36Z) - DRBM-ClustNet: A Deep Restricted Boltzmann-Kohonen Architecture for Data
Clustering [0.0]
A Bayesian Deep Restricted Boltzmann-Kohonen architecture for data clustering termed as DRBM-ClustNet is proposed.
The processing of unlabeled data is done in three stages for efficient clustering of the non-linearly separable datasets.
The framework is evaluated based on clustering accuracy and ranked against other state-of-the-art clustering methods.
arXiv Detail & Related papers (2022-05-13T15:12:18Z) - SQuadMDS: a lean Stochastic Quartet MDS improving global structure
preservation in neighbor embedding like t-SNE and UMAP [3.7731754155538164]
This work introduces a force directed approach to multidimensional scaling with a time and space complexity of O(N) with N data points.
The method can be combined with force directed layouts of the family of neighbour embedding such as t-SNE, to produce embeddings that preserve both the global and the local structures of the data.
arXiv Detail & Related papers (2022-02-24T13:14:58Z) - Index $t$-SNE: Tracking Dynamics of High-Dimensional Datasets with
Coherent Embeddings [1.7188280334580195]
This paper presents a methodology to reuse an embedding to create a new one, where cluster positions are preserved.
The proposed algorithm has the same complexity as the original $t$-SNE to embed new items, and a lower one when considering the embedding of a dataset sliced into sub-pieces.
arXiv Detail & Related papers (2021-09-22T06:45:37Z) - Manifold Topology Divergence: a Framework for Comparing Data Manifolds [109.0784952256104]
We develop a framework for comparing data manifold, aimed at the evaluation of deep generative models.
Based on the Cross-Barcode, we introduce the Manifold Topology Divergence score (MTop-Divergence)
We demonstrate that the MTop-Divergence accurately detects various degrees of mode-dropping, intra-mode collapse, mode invention, and image disturbance.
arXiv Detail & Related papers (2021-06-08T00:30:43Z) - A Local Similarity-Preserving Framework for Nonlinear Dimensionality
Reduction with Neural Networks [56.068488417457935]
We propose a novel local nonlinear approach named Vec2vec for general purpose dimensionality reduction.
To train the neural network, we build the neighborhood similarity graph of a matrix and define the context of data points.
Experiments of data classification and clustering on eight real datasets show that Vec2vec is better than several classical dimensionality reduction methods in the statistical hypothesis test.
arXiv Detail & Related papers (2021-03-10T23:10:47Z) - Clustering small datasets in high-dimension by random projection [2.2940141855172027]
We propose a low-computation method to find statistically significant clustering structures in a small dataset.
The method proceeds by projecting the data on a random line and seeking binary clusterings in the resulting one-dimensional data.
The statistical validity of the clustering structures obtained is tested in the projected one-dimensional space.
arXiv Detail & Related papers (2020-08-21T16:49:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.