Intrinsic dimension estimation for discrete metrics
- URL: http://arxiv.org/abs/2207.09688v1
- Date: Wed, 20 Jul 2022 06:38:36 GMT
- Title: Intrinsic dimension estimation for discrete metrics
- Authors: Iuri Macocco, Aldo Glielmo, Jacopo Grilli and Alessandro Laio
- Abstract summary: In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces.
We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting.
This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
- Score: 65.5438227932088
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real world-datasets characterized by discrete features are ubiquitous: from
categorical surveys to clinical questionnaires, from unweighted networks to DNA
sequences. Nevertheless, the most common unsupervised dimensional reduction
methods are designed for continuous spaces, and their use for discrete spaces
can lead to errors and biases. In this letter we introduce an algorithm to
infer the intrinsic dimension (ID) of datasets embedded in discrete spaces. We
demonstrate its accuracy on benchmark datasets, and we apply it to analyze a
metagenomic dataset for species fingerprinting, finding a surprisingly small
ID, of order 2. This suggests that evolutive pressure acts on a low-dimensional
manifold despite the high-dimensionality of sequences' space.
Related papers
- $\Gamma$-VAE: Curvature regularized variational autoencoders for
uncovering emergent low dimensional geometric structure in high dimensional
data [0.25128687379089687]
Natural systems with emergent behaviors often organize along low-dimensional subsets of high-dimensional spaces.
We show that regularizing the curvature of generative models will enable more consistent, predictive, and generalizable models.
arXiv Detail & Related papers (2024-03-02T03:26:09Z) - Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.
In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.
This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - Random Smoothing Regularization in Kernel Gradient Descent Learning [24.383121157277007]
We present a framework for random smoothing regularization that can adaptively learn a wide range of ground truth functions belonging to the classical Sobolev spaces.
Our estimator can adapt to the structural assumptions of the underlying data and avoid the curse of dimensionality.
arXiv Detail & Related papers (2023-05-05T13:37:34Z) - Topological Singularity Detection at Multiple Scales [11.396560798899413]
Real-world data exhibits distinct non-manifold structures that can lead to erroneous findings.
We develop a framework that quantifies the local intrinsic dimension, and yields a Euclidicity score for assessing the'manifoldness' of a point along multiple scales.
Our approach identifies singularities of complex spaces, while also capturing singular structures and local geometric complexity in image data.
arXiv Detail & Related papers (2022-09-30T20:00:32Z) - Analyzing the Latent Space of GAN through Local Dimension Estimation [4.688163910878411]
style-based GANs (StyleGANs) in high-fidelity image synthesis have motivated research to understand the semantic properties of their latent spaces.
We propose a local dimension estimation algorithm for arbitrary intermediate layers in a pre-trained GAN model.
Our proposed metric, called Distortion, measures an inconsistency of intrinsic space on the learned latent space.
arXiv Detail & Related papers (2022-05-26T06:36:06Z) - Intrinsic Dimension Estimation [92.87600241234344]
We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guarantees.
We then apply our techniques to get new sample complexity bounds for Generative Adversarial Networks (GANs) depending on the intrinsic dimension of the data.
arXiv Detail & Related papers (2021-06-08T00:05:39Z) - A Local Similarity-Preserving Framework for Nonlinear Dimensionality
Reduction with Neural Networks [56.068488417457935]
We propose a novel local nonlinear approach named Vec2vec for general purpose dimensionality reduction.
To train the neural network, we build the neighborhood similarity graph of a matrix and define the context of data points.
Experiments of data classification and clustering on eight real datasets show that Vec2vec is better than several classical dimensionality reduction methods in the statistical hypothesis test.
arXiv Detail & Related papers (2021-03-10T23:10:47Z) - Manifold Learning via Manifold Deflation [105.7418091051558]
dimensionality reduction methods provide a valuable means to visualize and interpret high-dimensional data.
Many popular methods can fail dramatically, even on simple two-dimensional Manifolds.
This paper presents an embedding method for a novel, incremental tangent space estimator that incorporates global structure as coordinates.
Empirically, we show our algorithm recovers novel and interesting embeddings on real-world and synthetic datasets.
arXiv Detail & Related papers (2020-07-07T10:04:28Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.