Related papers: Beyond the noise: intrinsic dimension estimation with optimal neighbourhood identification

Beyond the noise: intrinsic dimension estimation with optimal neighbourhood identification

URL: http://arxiv.org/abs/2405.15132v1
Date: Fri, 24 May 2024 01:08:05 GMT
Title: Beyond the noise: intrinsic dimension estimation with optimal neighbourhood identification
Authors: Antonio Di Noia, Iuri Macocco, Aldo Glielmo, Alessandro Laio, Antonietta Mira,
Abstract summary: The Intrinsic Dimension (ID) is a key concept in unsupervised learning and feature selection. In almost any real-world dataset the ID depends on the scale at which the data are analysed. We introduce an automatic protocol to select the sweet spot, namely the correct range of scales in which the ID is meaningful and useful.
Score: 43.26660964074272
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Intrinsic Dimension (ID) is a key concept in unsupervised learning and feature selection, as it is a lower bound to the number of variables which are necessary to describe a system. However, in almost any real-world dataset the ID depends on the scale at which the data are analysed. Quite typically at a small scale, the ID is very large, as the data are affected by measurement errors. At large scale, the ID can also be erroneously large, due to the curvature and the topology of the manifold containing the data. In this work, we introduce an automatic protocol to select the sweet spot, namely the correct range of scales in which the ID is meaningful and useful. This protocol is based on imposing that for distances smaller than the correct scale the density of the data is constant. Since to estimate the density it is necessary to know the ID, this condition is imposed self-consistently. We illustrate the usefulness and robustness of this procedure by benchmarks on artificial and real-world datasets.

Related papers

A Survey of Dimension Estimation Methods [0.0]
It is important to understand the real dimension of the data, hence the complexity of the dataset at hand.<n>This survey reviews a wide range of dimension estimation methods, categorising them by the geometric information they exploit.<n>The paper evaluates the performance of these methods, as well as investigating varying responses to curvature and noise.
arXiv Detail & Related papers (2025-07-18T13:05:42Z)
Simple and Effective Augmentation Methods for CSI Based Indoor Localization [37.3026733673066]
We propose two algorithms for channel state information based indoor localization motivated by physical considerations. As little as 10% of the original dataset size is enough to get the same performance as the original dataset. If we further augment the dataset with the proposed techniques, test accuracy is improved more than three-fold.
arXiv Detail & Related papers (2022-11-19T20:27:46Z)
Intrinsic Dimensionality Estimation within Tight Localities: A Theoretical and Experimental Analysis [0.0]
We propose a local ID estimation strategy stable even for tight' localities consisting of as few as 20 sample points. Our experimental results show that our proposed estimation technique can achieve notably smaller variance, while maintaining comparable levels of bias, at much smaller sample sizes than state-of-the-art estimators.
arXiv Detail & Related papers (2022-09-29T00:00:11Z)
Intrinsic dimension estimation for discrete metrics [65.5438227932088]
In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces. We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting. This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
arXiv Detail & Related papers (2022-07-20T06:38:36Z)
No Shifted Augmentations (NSA): compact distributions for robust self-supervised Anomaly Detection [4.243926243206826]
Unsupervised Anomaly detection (AD) requires building a notion of normalcy, distinguishing in-distribution (ID) and out-of-distribution (OOD) data. We investigate how the emph geometrical compactness of the ID feature distribution makes isolating and detecting outliers easier. We propose novel architectural modifications to the self-supervised feature learning step, that enable such compact distributions for ID data to be learned.
arXiv Detail & Related papers (2022-03-19T15:55:32Z)
Featurized Density Ratio Estimation [82.40706152910292]
In our work, we propose to leverage an invertible generative model to map the two distributions into a common feature space prior to estimation. This featurization brings the densities closer together in latent space, sidestepping pathological scenarios where the learned density ratios in input space can be arbitrarily inaccurate. At the same time, the invertibility of our feature map guarantees that the ratios computed in feature space are equivalent to those in input space.
arXiv Detail & Related papers (2021-07-05T18:30:26Z)
Improving Face Recognition by Clustering Unlabeled Faces in the Wild [77.48677160252198]
We propose a novel identity separation method based on extreme value theory. It greatly reduces the problems caused by overlapping-identity label noise. Experiments on both controlled and real settings demonstrate our method's consistent improvements.
arXiv Detail & Related papers (2020-07-14T12:26:50Z)
Variable Skipping for Autoregressive Range Density Estimation [84.60428050170687]
We show a technique, variable skipping, for accelerating range density estimation over deep autoregressive models. We show that variable skipping provides 10-100$times$ efficiency improvements when targeting challenging high-quantile error metrics.
arXiv Detail & Related papers (2020-07-10T19:01:40Z)
Uncertainty Estimation Using a Single Deep Deterministic Neural Network [66.26231423824089]
We propose a method for training a deterministic deep model that can find and reject out of distribution data points at test time with a single forward pass. We scale training in these with a novel loss function and centroid updating scheme and match the accuracy of softmax models.
arXiv Detail & Related papers (2020-03-04T12:27:36Z)
Local intrinsic dimensionality estimators based on concentration of measure [0.0]
Intrinsic dimensionality (ID) is one of the most fundamental characteristics of multi-dimensional data point clouds. We introduce new local estimators of ID based on linear separability of multi-dimensional data point clouds.
arXiv Detail & Related papers (2020-01-31T09:49:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.