Local intrinsic dimensionality estimators based on concentration of
measure
- URL: http://arxiv.org/abs/2001.11739v3
- Date: Sun, 19 Apr 2020 10:54:08 GMT
- Title: Local intrinsic dimensionality estimators based on concentration of
measure
- Authors: Jonathan Bac, Andrei Zinovyev
- Abstract summary: Intrinsic dimensionality (ID) is one of the most fundamental characteristics of multi-dimensional data point clouds.
We introduce new local estimators of ID based on linear separability of multi-dimensional data point clouds.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Intrinsic dimensionality (ID) is one of the most fundamental characteristics
of multi-dimensional data point clouds. Knowing ID is crucial to choose the
appropriate machine learning approach as well as to understand its behavior and
validate it. ID can be computed globally for the whole data point distribution,
or computed locally in different regions of the data space. In this paper, we
introduce new local estimators of ID based on linear separability of
multi-dimensional data point clouds, which is one of the manifestations of
concentration of measure. We empirically study the properties of these
estimators and compare them with other recently introduced ID estimators
exploiting various effects of measure concentration. Observed differences
between estimators can be used to anticipate their behaviour in practical
applications.
Related papers
- Scaling Laws for the Value of Individual Data Points in Machine Learning [55.596413470429475]
We introduce a new perspective by investigating scaling behavior for the value of individual data points.
We provide learning theory to support our scaling law, and we observe empirically that it holds across diverse model classes.
Our work represents a first step towards understanding and utilizing scaling properties for the value of individual data points.
arXiv Detail & Related papers (2024-05-30T20:10:24Z) - Beyond the noise: intrinsic dimension estimation with optimal neighbourhood identification [43.26660964074272]
The Intrinsic Dimension (ID) is a key concept in unsupervised learning and feature selection.
We introduce an automatic protocol to select the sweet spot, namely the correct range of scales in which the ID is meaningful and useful.
We derive theoretical guarantees and illustrate the usefulness and robustness of this procedure by benchmarks on artificial and real-world datasets.
arXiv Detail & Related papers (2024-05-24T01:08:05Z) - DCID: Deep Canonical Information Decomposition [84.59396326810085]
We consider the problem of identifying the signal shared between two one-dimensional target variables.
We propose ICM, an evaluation metric which can be used in the presence of ground-truth labels.
We also propose Deep Canonical Information Decomposition (DCID) - a simple, yet effective approach for learning the shared variables.
arXiv Detail & Related papers (2023-06-27T16:59:06Z) - Intrinsic Dimension for Large-Scale Geometric Learning [0.0]
A naive approach to determine the dimension of a dataset is based on the number of attributes.
More sophisticated methods derive a notion of intrinsic dimension (ID) that employs more complex feature functions.
arXiv Detail & Related papers (2022-10-11T09:50:50Z) - Intrinsic Dimensionality Estimation within Tight Localities: A
Theoretical and Experimental Analysis [0.0]
We propose a local ID estimation strategy stable even for tight' localities consisting of as few as 20 sample points.
Our experimental results show that our proposed estimation technique can achieve notably smaller variance, while maintaining comparable levels of bias, at much smaller sample sizes than state-of-the-art estimators.
arXiv Detail & Related papers (2022-09-29T00:00:11Z) - Intrinsic dimension estimation for discrete metrics [65.5438227932088]
In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces.
We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting.
This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
arXiv Detail & Related papers (2022-07-20T06:38:36Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Local Intrinsic Dimensionality Signals Adversarial Perturbations [28.328973408891834]
Local dimensionality (LID) is a local metric that describes the minimum number of latent variables required to describe each data point.
In this paper, we derive a lower-bound and an upper-bound for the LID value of a perturbed data point and demonstrate that the bounds, in particular the lower-bound, has a positive correlation with the magnitude of the perturbation.
arXiv Detail & Related papers (2021-09-24T08:29:50Z) - Featurized Density Ratio Estimation [82.40706152910292]
In our work, we propose to leverage an invertible generative model to map the two distributions into a common feature space prior to estimation.
This featurization brings the densities closer together in latent space, sidestepping pathological scenarios where the learned density ratios in input space can be arbitrarily inaccurate.
At the same time, the invertibility of our feature map guarantees that the ratios computed in feature space are equivalent to those in input space.
arXiv Detail & Related papers (2021-07-05T18:30:26Z) - Nonparametric Density Estimation from Markov Chains [68.8204255655161]
We introduce a new nonparametric density estimator inspired by Markov Chains, and generalizing the well-known Kernel Density Estor.
Our estimator presents several benefits with respect to the usual ones and can be used straightforwardly as a foundation in all density-based algorithms.
arXiv Detail & Related papers (2020-09-08T18:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.