Clustering small datasets in high-dimension by random projection
- URL: http://arxiv.org/abs/2008.09579v1
- Date: Fri, 21 Aug 2020 16:49:37 GMT
- Title: Clustering small datasets in high-dimension by random projection
- Authors: Alden Bradford, Tarun Yellamraju, and Mireille Boutin
- Abstract summary: We propose a low-computation method to find statistically significant clustering structures in a small dataset.
The method proceeds by projecting the data on a random line and seeking binary clusterings in the resulting one-dimensional data.
The statistical validity of the clustering structures obtained is tested in the projected one-dimensional space.
- Score: 2.2940141855172027
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Datasets in high-dimension do not typically form clusters in their original
space; the issue is worse when the number of points in the dataset is small. We
propose a low-computation method to find statistically significant clustering
structures in a small dataset. The method proceeds by projecting the data on a
random line and seeking binary clusterings in the resulting one-dimensional
data. Non-linear separations are obtained by extending the feature space using
monomials of higher degrees in the original features. The statistical validity
of the clustering structures obtained is tested in the projected
one-dimensional space, thus bypassing the challenge of statistical validation
in high-dimension. Projecting on a random line is an extreme dimension
reduction technique that has previously been used successfully as part of a
hierarchical clustering method for high-dimensional data. Our experiments show
that with this simplified framework, statistically significant clustering
structures can be found with as few as 100-200 points, depending on the
dataset. The different structures uncovered are found to persist as more points
are added to the dataset.
Related papers
- Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.
In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.
This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - Spatio-Temporal Surrogates for Interaction of a Jet with High
Explosives: Part II -- Clustering Extremely High-Dimensional Grid-Based Data [0.0]
In this report, we consider output data from simulations of a jet interacting with high explosives.
We show how we can use the randomness of both the random projections, and the choice of initial centroids in k-means clustering, to determine the number of clusters in our data set.
arXiv Detail & Related papers (2023-07-03T23:36:43Z) - Unsupervised Manifold Linearizing and Clustering [19.879641608165887]
We propose to optimize the Maximal Coding Reduction metric with respect to both the data representation and a novel doubly cluster membership.
Experiments on CIFAR-10, -20, -100, and TinyImageNet-200 datasets show that the proposed method is much more accurate and scalable than state-of-the-art deep clustering methods.
arXiv Detail & Related papers (2023-01-04T20:08:23Z) - Intrinsic dimension estimation for discrete metrics [65.5438227932088]
In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces.
We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting.
This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
arXiv Detail & Related papers (2022-07-20T06:38:36Z) - Anomaly Clustering: Grouping Images into Coherent Clusters of Anomaly
Types [60.45942774425782]
We introduce anomaly clustering, whose goal is to group data into coherent clusters of anomaly types.
This is different from anomaly detection, whose goal is to divide anomalies from normal data.
We present a simple yet effective clustering framework using a patch-based pretrained deep embeddings and off-the-shelf clustering methods.
arXiv Detail & Related papers (2021-12-21T23:11:33Z) - A Local Similarity-Preserving Framework for Nonlinear Dimensionality
Reduction with Neural Networks [56.068488417457935]
We propose a novel local nonlinear approach named Vec2vec for general purpose dimensionality reduction.
To train the neural network, we build the neighborhood similarity graph of a matrix and define the context of data points.
Experiments of data classification and clustering on eight real datasets show that Vec2vec is better than several classical dimensionality reduction methods in the statistical hypothesis test.
arXiv Detail & Related papers (2021-03-10T23:10:47Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - Self-Representation Based Unsupervised Exemplar Selection in a Union of
Subspaces [27.22427926657327]
We present a new exemplar selection model that searches for a subset that best reconstructs all data points as measured by the $ell_1$ norm of the representation coefficients.
When the dataset is drawn from a union of independent subspaces, our method is able to select sufficiently many representatives from each subspace.
We also develop an exemplar based subspace clustering method that is robust to imbalanced data and efficient for large scale data.
arXiv Detail & Related papers (2020-06-07T19:43:33Z) - Two-Dimensional Semi-Nonnegative Matrix Factorization for Clustering [50.43424130281065]
We propose a new Semi-Nonnegative Matrix Factorization method for 2-dimensional (2D) data, named TS-NMF.
It overcomes the drawback of existing methods that seriously damage the spatial information of the data by converting 2D data to vectors in a preprocessing step.
arXiv Detail & Related papers (2020-05-19T05:54:14Z) - Stochastic Sparse Subspace Clustering [20.30051592270384]
State-of-the-art subspace clustering methods are based on self-expressive model, which represents each data point as a linear combination of other data points.
We introduce dropout to address the issue of over-segmentation, which is based on randomly dropping out data points.
This leads to a scalable and flexible sparse subspace clustering approach, termed Sparse Subspace Clustering.
arXiv Detail & Related papers (2020-05-04T13:09:17Z) - A new model for natural groupings in high-dimensional data [0.4604003661048266]
Clustering aims to divide a set of points into groups.
Recent experiments have uncovered several high-dimensional datasets that form different binary groupings.
This paper describes a probability model for the data that could explain this phenomenon.
arXiv Detail & Related papers (2019-09-14T02:38:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.