Intrinsic Dimensionality Estimation within Tight Localities: A
Theoretical and Experimental Analysis
- URL: http://arxiv.org/abs/2209.14475v1
- Date: Thu, 29 Sep 2022 00:00:11 GMT
- Title: Intrinsic Dimensionality Estimation within Tight Localities: A
Theoretical and Experimental Analysis
- Authors: Laurent Amsaleg (CNRS-IRISA, France), Oussama Chelly (Amazon Web
Services, Munich, Germany), Michael E. Houle (The University of Melbourne,
Australia), Ken-ichi Kawarabayashi (National Institute of Informatics,
Japan), Milo\v{s} Radovanovi\'c (University of Novi Sad, Serbia), Weeris
Treeratanajaru (Bank of Thailand)
- Abstract summary: We propose a local ID estimation strategy stable even for tight' localities consisting of as few as 20 sample points.
Our experimental results show that our proposed estimation technique can achieve notably smaller variance, while maintaining comparable levels of bias, at much smaller sample sizes than state-of-the-art estimators.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Accurate estimation of Intrinsic Dimensionality (ID) is of crucial importance
in many data mining and machine learning tasks, including dimensionality
reduction, outlier detection, similarity search and subspace clustering.
However, since their convergence generally requires sample sizes (that is,
neighborhood sizes) on the order of hundreds of points, existing ID estimation
methods may have only limited usefulness for applications in which the data
consists of many natural groups of small size. In this paper, we propose a
local ID estimation strategy stable even for `tight' localities consisting of
as few as 20 sample points. The estimator applies MLE techniques over all
available pairwise distances among the members of the sample, based on a recent
extreme-value-theoretic model of intrinsic dimensionality, the Local Intrinsic
Dimension (LID). Our experimental results show that our proposed estimation
technique can achieve notably smaller variance, while maintaining comparable
levels of bias, at much smaller sample sizes than state-of-the-art estimators.
Related papers
- Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Dimensionality-Aware Outlier Detection: Theoretical and Experimental Analysis [9.962838991341874]
We present a nonparametric method for outlier detection that takes full account of local variations in dimensionality within the dataset.
We show that significantly outperforms three popular and important benchmark outlier detection methods.
arXiv Detail & Related papers (2024-01-10T01:07:35Z) - Robust Bayesian Subspace Identification for Small Data Sets [91.3755431537592]
We propose regularized estimators, shrinkage estimators and Bayesian estimation to reduce the effect of variance.
Our experimental results show that our proposed estimators may reduce the estimation risk up to $40%$ of that of traditional subspace methods.
arXiv Detail & Related papers (2022-12-29T00:29:04Z) - Intrinsic dimension estimation for discrete metrics [65.5438227932088]
In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces.
We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting.
This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
arXiv Detail & Related papers (2022-07-20T06:38:36Z) - Supervised Multivariate Learning with Simultaneous Feature Auto-grouping
and Dimension Reduction [7.093830786026851]
This paper proposes a novel clustered reduced-rank learning framework.
It imposes two joint matrix regularizations to automatically group the features in constructing predictive factors.
It is more interpretable than low-rank modeling and relaxes the stringent sparsity assumption in variable selection.
arXiv Detail & Related papers (2021-12-17T20:11:20Z) - Local Intrinsic Dimensionality Signals Adversarial Perturbations [28.328973408891834]
Local dimensionality (LID) is a local metric that describes the minimum number of latent variables required to describe each data point.
In this paper, we derive a lower-bound and an upper-bound for the LID value of a perturbed data point and demonstrate that the bounds, in particular the lower-bound, has a positive correlation with the magnitude of the perturbation.
arXiv Detail & Related papers (2021-09-24T08:29:50Z) - Featurized Density Ratio Estimation [82.40706152910292]
In our work, we propose to leverage an invertible generative model to map the two distributions into a common feature space prior to estimation.
This featurization brings the densities closer together in latent space, sidestepping pathological scenarios where the learned density ratios in input space can be arbitrarily inaccurate.
At the same time, the invertibility of our feature map guarantees that the ratios computed in feature space are equivalent to those in input space.
arXiv Detail & Related papers (2021-07-05T18:30:26Z) - Meta-Learning for Relative Density-Ratio Estimation [59.75321498170363]
Existing methods for (relative) density-ratio estimation (DRE) require many instances from both densities.
We propose a meta-learning method for relative DRE, which estimates the relative density-ratio from a few instances by using knowledge in related datasets.
We empirically demonstrate the effectiveness of the proposed method by using three problems: relative DRE, dataset comparison, and outlier detection.
arXiv Detail & Related papers (2021-07-02T02:13:45Z) - Intrinsic Dimension Estimation [92.87600241234344]
We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guarantees.
We then apply our techniques to get new sample complexity bounds for Generative Adversarial Networks (GANs) depending on the intrinsic dimension of the data.
arXiv Detail & Related papers (2021-06-08T00:05:39Z) - Local intrinsic dimensionality estimators based on concentration of
measure [0.0]
Intrinsic dimensionality (ID) is one of the most fundamental characteristics of multi-dimensional data point clouds.
We introduce new local estimators of ID based on linear separability of multi-dimensional data point clouds.
arXiv Detail & Related papers (2020-01-31T09:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.