Beyond the noise: intrinsic dimension estimation with optimal neighbourhood identification
- URL: http://arxiv.org/abs/2405.15132v1
- Date: Fri, 24 May 2024 01:08:05 GMT
- Title: Beyond the noise: intrinsic dimension estimation with optimal neighbourhood identification
- Authors: Antonio Di Noia, Iuri Macocco, Aldo Glielmo, Alessandro Laio, Antonietta Mira,
- Abstract summary: The Intrinsic Dimension (ID) is a key concept in unsupervised learning and feature selection.
In almost any real-world dataset the ID depends on the scale at which the data are analysed.
We introduce an automatic protocol to select the sweet spot, namely the correct range of scales in which the ID is meaningful and useful.
- Score: 43.26660964074272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Intrinsic Dimension (ID) is a key concept in unsupervised learning and feature selection, as it is a lower bound to the number of variables which are necessary to describe a system. However, in almost any real-world dataset the ID depends on the scale at which the data are analysed. Quite typically at a small scale, the ID is very large, as the data are affected by measurement errors. At large scale, the ID can also be erroneously large, due to the curvature and the topology of the manifold containing the data. In this work, we introduce an automatic protocol to select the sweet spot, namely the correct range of scales in which the ID is meaningful and useful. This protocol is based on imposing that for distances smaller than the correct scale the density of the data is constant. Since to estimate the density it is necessary to know the ID, this condition is imposed self-consistently. We illustrate the usefulness and robustness of this procedure by benchmarks on artificial and real-world datasets.
Related papers
- Simple and Effective Augmentation Methods for CSI Based Indoor
Localization [37.3026733673066]
We propose two algorithms for channel state information based indoor localization motivated by physical considerations.
As little as 10% of the original dataset size is enough to get the same performance as the original dataset.
If we further augment the dataset with the proposed techniques, test accuracy is improved more than three-fold.
arXiv Detail & Related papers (2022-11-19T20:27:46Z) - Intrinsic dimension estimation for discrete metrics [65.5438227932088]
In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces.
We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting.
This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
arXiv Detail & Related papers (2022-07-20T06:38:36Z) - No Shifted Augmentations (NSA): compact distributions for robust
self-supervised Anomaly Detection [4.243926243206826]
Unsupervised Anomaly detection (AD) requires building a notion of normalcy, distinguishing in-distribution (ID) and out-of-distribution (OOD) data.
We investigate how the emph geometrical compactness of the ID feature distribution makes isolating and detecting outliers easier.
We propose novel architectural modifications to the self-supervised feature learning step, that enable such compact distributions for ID data to be learned.
arXiv Detail & Related papers (2022-03-19T15:55:32Z) - Training Dynamic based data filtering may not work for NLP datasets [0.0]
We study the applicability of the Area Under the Margin (AUM) metric to identify mislabelled examples in NLP datasets.
We find that mislabelled samples can be filtered using the AUM metric in NLP datasets but it also removes a significant number of correctly labeled points.
arXiv Detail & Related papers (2021-09-19T18:50:45Z) - Featurized Density Ratio Estimation [82.40706152910292]
In our work, we propose to leverage an invertible generative model to map the two distributions into a common feature space prior to estimation.
This featurization brings the densities closer together in latent space, sidestepping pathological scenarios where the learned density ratios in input space can be arbitrarily inaccurate.
At the same time, the invertibility of our feature map guarantees that the ratios computed in feature space are equivalent to those in input space.
arXiv Detail & Related papers (2021-07-05T18:30:26Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - Improving Face Recognition by Clustering Unlabeled Faces in the Wild [77.48677160252198]
We propose a novel identity separation method based on extreme value theory.
It greatly reduces the problems caused by overlapping-identity label noise.
Experiments on both controlled and real settings demonstrate our method's consistent improvements.
arXiv Detail & Related papers (2020-07-14T12:26:50Z) - Variable Skipping for Autoregressive Range Density Estimation [84.60428050170687]
We show a technique, variable skipping, for accelerating range density estimation over deep autoregressive models.
We show that variable skipping provides 10-100$times$ efficiency improvements when targeting challenging high-quantile error metrics.
arXiv Detail & Related papers (2020-07-10T19:01:40Z) - Local intrinsic dimensionality estimators based on concentration of
measure [0.0]
Intrinsic dimensionality (ID) is one of the most fundamental characteristics of multi-dimensional data point clouds.
We introduce new local estimators of ID based on linear separability of multi-dimensional data point clouds.
arXiv Detail & Related papers (2020-01-31T09:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.