Statistical exploration of the Manifold Hypothesis
- URL: http://arxiv.org/abs/2208.11665v4
- Date: Fri, 9 Feb 2024 16:10:01 GMT
- Title: Statistical exploration of the Manifold Hypothesis
- Authors: Nick Whiteley, Annie Gray, Patrick Rubin-Delanchy
- Abstract summary: The Manifold Hypothesis asserts that nominally high-dimensional data are in fact concentrated near a low-dimensional manifold, embedded in high-dimensional space.
We show that rich and sometimes intricate manifold structure in data can emerge from a generic and remarkably simple statistical model.
We derive procedures to discover and interpret the geometry of high-dimensional data, and explore hypotheses about the data generating mechanism.
- Score: 10.389701595098922
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Manifold Hypothesis is a widely accepted tenet of Machine Learning which
asserts that nominally high-dimensional data are in fact concentrated near a
low-dimensional manifold, embedded in high-dimensional space. This phenomenon
is observed empirically in many real world situations, has led to development
of a wide range of statistical methods in the last few decades, and has been
suggested as a key factor in the success of modern AI technologies. We show
that rich and sometimes intricate manifold structure in data can emerge from a
generic and remarkably simple statistical model -- the Latent Metric Model --
via elementary concepts such as latent variables, correlation and stationarity.
This establishes a general statistical explanation for why the Manifold
Hypothesis seems to hold in so many situations. Informed by the Latent Metric
Model we derive procedures to discover and interpret the geometry of
high-dimensional data, and explore hypotheses about the data generating
mechanism. These procedures operate under minimal assumptions and make use of
well known, scaleable graph-analytic algorithms.
Related papers
- A Likelihood Based Approach to Distribution Regression Using Conditional Deep Generative Models [6.647819824559201]
We study the large-sample properties of a likelihood-based approach for estimating conditional deep generative models.
Our results lead to the convergence rate of a sieve maximum likelihood estimator for estimating the conditional distribution.
arXiv Detail & Related papers (2024-10-02T20:46:21Z) - Emerging-properties Mapping Using Spatial Embedding Statistics: EMUSES [0.0]
EMUSES is an innovative approach to create high-dimensional embeddings that reveal latent structures within data.
By bridging the gap between predictive accuracy and interpretability, EMUSES offers researchers a powerful tool to understand the multifactorial origins of complex phenomena.
arXiv Detail & Related papers (2024-06-20T13:39:14Z) - Learning Discrete Concepts in Latent Hierarchical Models [73.01229236386148]
Learning concepts from natural high-dimensional data holds potential in building human-aligned and interpretable machine learning models.
We formalize concepts as discrete latent causal variables that are related via a hierarchical causal model.
We substantiate our theoretical claims with synthetic data experiments.
arXiv Detail & Related papers (2024-06-01T18:01:03Z) - Diffusion posterior sampling for simulation-based inference in tall data settings [53.17563688225137]
Simulation-based inference ( SBI) is capable of approximating the posterior distribution that relates input parameters to a given observation.
In this work, we consider a tall data extension in which multiple observations are available to better infer the parameters of the model.
We compare our method to recently proposed competing approaches on various numerical experiments and demonstrate its superiority in terms of numerical stability and computational cost.
arXiv Detail & Related papers (2024-04-11T09:23:36Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - A phase transition between positional and semantic learning in a solvable model of dot-product attention [30.96921029675713]
Morelinear model dot-product attention is studied as a non-dimensional self-attention layer with trainable and low-dimensional query and key data.
We show that either a positional attention mechanism (with tokens each other based on their respective positions) or a semantic attention mechanism (with tokens tied to each other based their meaning) or a transition from the former to the latter with increasing sample complexity.
arXiv Detail & Related papers (2024-02-06T11:13:54Z) - A Multivariate Unimodality Test Harnessing the Dip Statistic of Mahalanobis Distances Over Random Projections [0.18416014644193066]
We extend one-dimensional unimodality principles to multi-dimensional spaces through linear random projections and point-to-point distancing.
Our method, rooted in $alpha$-unimodality assumptions, presents a novel unimodality test named mud-pod.
Both theoretical and empirical studies confirm the efficacy of our method in unimodality assessment of multidimensional datasets.
arXiv Detail & Related papers (2023-11-28T09:11:02Z) - Conformal inference for regression on Riemannian Manifolds [49.7719149179179]
We investigate prediction sets for regression scenarios when the response variable, denoted by $Y$, resides in a manifold, and the covariable, denoted by X, lies in Euclidean space.
We prove the almost sure convergence of the empirical version of these regions on the manifold to their population counterparts.
arXiv Detail & Related papers (2023-10-12T10:56:25Z) - Towards a mathematical understanding of learning from few examples with
nonlinear feature maps [68.8204255655161]
We consider the problem of data classification where the training set consists of just a few data points.
We reveal key relationships between the geometry of an AI model's feature space, the structure of the underlying data distributions, and the model's generalisation capabilities.
arXiv Detail & Related papers (2022-11-07T14:52:58Z) - Learning from few examples with nonlinear feature maps [68.8204255655161]
We explore the phenomenon and reveal key relationships between dimensionality of AI model's feature space, non-degeneracy of data distributions, and the model's generalisation capabilities.
The main thrust of our present analysis is on the influence of nonlinear feature transformations mapping original data into higher- and possibly infinite-dimensional spaces on the resulting model's generalisation capabilities.
arXiv Detail & Related papers (2022-03-31T10:36:50Z) - Information-theoretic limits of a multiview low-rank symmetric spiked
matrix model [19.738567726658875]
We consider a generalization of an important class of high-dimensional inference problems, namely spiked symmetric matrix models.
We rigorously establish the information-theoretic limits through the proof of single-letter formulas.
We improve the recently introduced adaptive method, so that it can be used to study low-rank models.
arXiv Detail & Related papers (2020-05-16T15:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.