A new model for natural groupings in high-dimensional data
- URL: http://arxiv.org/abs/1909.06511v2
- Date: Mon, 24 Jun 2024 13:02:32 GMT
- Title: A new model for natural groupings in high-dimensional data
- Authors: Mireille Boutin, Evzenie Coupkova,
- Abstract summary: Clustering aims to divide a set of points into groups.
Recent experiments have uncovered several high-dimensional datasets that form different binary groupings.
This paper describes a probability model for the data that could explain this phenomenon.
- Score: 0.4604003661048266
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Clustering aims to divide a set of points into groups. The current paradigm assumes that the grouping is well-defined (unique) given the probability model from which the data is drawn. Yet, recent experiments have uncovered several high-dimensional datasets that form different binary groupings after projecting the data to randomly chosen one-dimensional subspaces. This paper describes a probability model for the data that could explain this phenomenon. It is a simple model to serve as a proof of concept for understanding the geometry of high-dimensional data. We start by building a rescaled multivariate Bernouilli model (stretched hypercube) so to create several overlapping grouping structures in the data. The size of each scaling parameter is related to the likelihood of uncovering the corresponding grouping by random 1D projection. Clusters in the original space are then created by adding noise to this cluster-free model. In high dimension, these clusters would hardly be observable given a sample set from the distribution because of the curse of dimensionality, but the binary groupings are clear. Our construction makes it clear that one needs to make a distinction between "groupings" and "clusters" in the original space. It also highlights the need to interpret any clustering found in projected data as merely one among potentially many other groupings in a dataset.
Related papers
- Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.
In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.
This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - Spatio-Temporal Surrogates for Interaction of a Jet with High
Explosives: Part II -- Clustering Extremely High-Dimensional Grid-Based Data [0.0]
In this report, we consider output data from simulations of a jet interacting with high explosives.
We show how we can use the randomness of both the random projections, and the choice of initial centroids in k-means clustering, to determine the number of clusters in our data set.
arXiv Detail & Related papers (2023-07-03T23:36:43Z) - A Class of Dependent Random Distributions Based on Atom Skipping [2.3258287344692676]
We propose the Plaid Atoms Model (PAM), a novel Bayesian nonparametric model for grouped data.
PAM produces a dependent clustering pattern with overlapping and non-overlapping clusters across groups.
arXiv Detail & Related papers (2023-04-28T16:18:43Z) - Randomly Projected Convex Clustering Model: Motivation, Realization, and
Cluster Recovery Guarantees [18.521314122101774]
We propose a randomly projected convex clustering model for clustering a collection of $n$ high dimensional data points in $mathbbRd$ with $K$ hidden clusters.
We prove that, under some mild conditions, the perfect recovery of the cluster membership assignments of the convex clustering model can be preserved.
The numerical results presented in this paper also demonstrate that the randomly projected convex clustering model can outperform the randomly projected K-means model in practice.
arXiv Detail & Related papers (2023-03-29T16:47:25Z) - Composite Feature Selection using Deep Ensembles [130.72015919510605]
We investigate the problem of discovering groups of predictive features without predefined grouping.
We introduce a novel deep learning architecture that uses an ensemble of feature selection models to find predictive groups.
We propose a new metric to measure similarity between discovered groups and the ground truth.
arXiv Detail & Related papers (2022-11-01T17:49:40Z) - Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular
data [81.43750358586072]
We propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes.
We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets.
arXiv Detail & Related papers (2022-10-24T08:57:55Z) - Local versions of sum-of-norms clustering [77.34726150561087]
We show that our method can separate arbitrarily close balls in the ball model.
We prove a quantitative bound on the error incurred in the clustering of disjoint connected sets.
arXiv Detail & Related papers (2021-09-20T14:45:29Z) - Sum-of-norms clustering does not separate nearby balls [49.1574468325115]
We show a continuous version of sum-of-norms clustering, where the dataset is replaced by a general measure.
We state and prove a local-global characterization of the clustering that seems to be new even in the case of discrete datapoints.
arXiv Detail & Related papers (2021-04-28T13:35:17Z) - Mixed data Deep Gaussian Mixture Model: A clustering model for mixed
datasets [0.0]
We introduce a model-based clustering method called Mixed Deep Gaussian Mixture Model (MDGMM)
This architecture is flexible and can be adapted to mixed as well as to continuous or non-continuous data.
Our model provides continuous low-dimensional representations of the data which can be a useful tool to visualize mixed datasets.
arXiv Detail & Related papers (2020-10-13T19:52:46Z) - Clustering small datasets in high-dimension by random projection [2.2940141855172027]
We propose a low-computation method to find statistically significant clustering structures in a small dataset.
The method proceeds by projecting the data on a random line and seeking binary clusterings in the resulting one-dimensional data.
The statistical validity of the clustering structures obtained is tested in the projected one-dimensional space.
arXiv Detail & Related papers (2020-08-21T16:49:37Z) - Conjoined Dirichlet Process [63.89763375457853]
We develop a novel, non-parametric probabilistic biclustering method based on Dirichlet processes to identify biclusters with strong co-occurrence in both rows and columns.
We apply our method to two different applications, text mining and gene expression analysis, and demonstrate that our method improves bicluster extraction in many settings compared to existing approaches.
arXiv Detail & Related papers (2020-02-08T19:41:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.