A Geometric View of Data Complexity: Efficient Local Intrinsic Dimension Estimation with Diffusion Models
- URL: http://arxiv.org/abs/2406.03537v2
- Date: Thu, 24 Oct 2024 18:01:03 GMT
- Title: A Geometric View of Data Complexity: Efficient Local Intrinsic Dimension Estimation with Diffusion Models
- Authors: Hamidreza Kamkari, Brendan Leigh Ross, Rasa Hosseinzadeh, Jesse C. Cresswell, Gabriel Loaiza-Ganem,
- Abstract summary: Estimating the local dimension intrinsic (LID) of a low-dimensional submanifold is a longstanding problem.
In this work, we show that the Fokker-Planck equation associated with a diffusion model can provide an LID estimator.
Applying FLIPD to synthetic LID estimation benchmarks, we find that DMs implemented as fully-connected networks are highly effective LID estimators.
- Score: 12.636148533844882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-dimensional data commonly lies on low-dimensional submanifolds, and estimating the local intrinsic dimension (LID) of a datum -- i.e. the dimension of the submanifold it belongs to -- is a longstanding problem. LID can be understood as the number of local factors of variation: the more factors of variation a datum has, the more complex it tends to be. Estimating this quantity has proven useful in contexts ranging from generalization in neural networks to detection of out-of-distribution data, adversarial examples, and AI-generated text. The recent successes of deep generative models present an opportunity to leverage them for LID estimation, but current methods based on generative models produce inaccurate estimates, require more than a single pre-trained model, are computationally intensive, or do not exploit the best available deep generative models: diffusion models (DMs). In this work, we show that the Fokker-Planck equation associated with a DM can provide an LID estimator which addresses the aforementioned deficiencies. Our estimator, called FLIPD, is easy to implement and compatible with all popular DMs. Applying FLIPD to synthetic LID estimation benchmarks, we find that DMs implemented as fully-connected networks are highly effective LID estimators that outperform existing baselines. We also apply FLIPD to natural images where the true LID is unknown. Despite being sensitive to the choice of network architecture, FLIPD estimates remain a useful measure of relative complexity; compared to competing estimators, FLIPD exhibits a consistently higher correlation with image PNG compression rate and better aligns with qualitative assessments of complexity. Notably, FLIPD is orders of magnitude faster than other LID estimators, and the first to be tractable at the scale of Stable Diffusion.
Related papers
- On Convolutions, Intrinsic Dimension, and Diffusion Models [9.220922665765153]
manifold hypothesis asserts that data of interest in high-dimensional ambient spaces, such as image data, lies on unknown low-dimensional submanifolds.<n>DMs are known to be able to learn distributions with low-dimensional support.
arXiv Detail & Related papers (2025-06-25T18:00:00Z) - Interpretable Deep Regression Models with Interval-Censored Failure Time Data [1.2993568435938014]
Deep learning methods for interval-censored data remain underexplored and limited to specific data type or model.
This work proposes a general regression framework for interval-censored data with a broad class of partially linear transformation models.
Applying our method to the Alzheimer's Disease Neuroimaging Initiative dataset yields novel insights and improved predictive performance compared to traditional approaches.
arXiv Detail & Related papers (2025-03-25T15:27:32Z) - OOD Detection with immature Models [8.477943884416023]
Likelihood-based deep generative models (DGMs) have gained significant attention for their ability to approximate the distributions of high-dimensional data.
These models lack a performance guarantee in assigning higher likelihood values to in-distribution (ID) inputs, data the models are trained on, compared to out-of-distribution (OOD) inputs.
In this work, we demonstrate that using immature models,stopped at early stages of training, can mostly achieve equivalent or even superior results on this downstream task.
arXiv Detail & Related papers (2025-02-02T15:14:17Z) - Can AI be enabled to dynamical downscaling? A Latent Diffusion Model to mimic km-scale COSMO5.0\_CLM9 simulations [0.0]
Downscaling techniques are one of the most prominent applications of Deep Learning (DL) in Earth System Modeling.
In this study, we apply a Latent Diffusion Model (LDM) to downscale ERA5 data over Italy up to a resolution of 2 km.
Our goal is to demonstrate that recent advancements in generative modeling enable DL to deliver results comparable to those of numerical dynamical models.
arXiv Detail & Related papers (2024-06-19T15:20:28Z) - Querying Easily Flip-flopped Samples for Deep Active Learning [63.62397322172216]
Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data.
One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is.
This paper proposes the it least disagree metric (LDM) as the smallest probability of disagreement of the predicted label.
arXiv Detail & Related papers (2024-01-18T08:12:23Z) - Proximal Symmetric Non-negative Latent Factor Analysis: A Novel Approach
to Highly-Accurate Representation of Undirected Weighted Networks [2.1797442801107056]
Undirected Weighted Network (UWN) is commonly found in big data-related applications.
Existing models fail in either modeling its intrinsic symmetry or low-data density.
Proximal Symmetric Nonnegative Latent-factor-analysis model is proposed.
arXiv Detail & Related papers (2023-06-06T13:03:24Z) - Hierarchical Integration Diffusion Model for Realistic Image Deblurring [71.76410266003917]
Diffusion models (DMs) have been introduced in image deblurring and exhibited promising performance.
We propose the Hierarchical Integration Diffusion Model (HI-Diff), for realistic image deblurring.
Experiments on synthetic and real-world blur datasets demonstrate that our HI-Diff outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T12:18:20Z) - Fast Latent Factor Analysis via a Fuzzy PID-Incorporated Stochastic
Gradient Descent Algorithm [1.984879854062214]
A gradient descent (SGD)-based latent factor analysis model is remarkably effective in extracting valuable information from an HDI matrix.
A standard SGD algorithm learns a latent factor relying on the gradient of current instance error only without considering past update information.
This paper proposes a Fuzzy PID-incorporated SGD algorithm with two-fold ideas: 1) rebuilding the instance error by considering the past update information in an efficient way following the principle of PID, and 2) implementing hyper-learnings and gain adaptation following the fuzzy rules.
arXiv Detail & Related papers (2023-03-07T14:51:09Z) - Tailoring Language Generation Models under Total Variation Distance [55.89964205594829]
The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method.
We develop practical bounds to apply it to language generation.
We introduce the TaiLr objective that balances the tradeoff of estimating TVD.
arXiv Detail & Related papers (2023-02-26T16:32:52Z) - Score Approximation, Estimation and Distribution Recovery of Diffusion
Models on Low-Dimensional Data [68.62134204367668]
This paper studies score approximation, estimation, and distribution recovery of diffusion models, when data are supported on an unknown low-dimensional linear subspace.
We show that with a properly chosen neural network architecture, the score function can be both accurately approximated and efficiently estimated.
The generated distribution based on the estimated score function captures the data geometric structures and converges to a close vicinity of the data distribution.
arXiv Detail & Related papers (2023-02-14T17:02:35Z) - Smooth densities and generative modeling with unsupervised random
forests [1.433758865948252]
An important application for density estimators is synthetic data generation.
We propose a new method based on unsupervised random forests for estimating smooth densities in arbitrary dimensions without parametric constraints.
We prove the consistency of our approach and demonstrate its advantages over existing tree-based density estimators.
arXiv Detail & Related papers (2022-05-19T09:50:25Z) - Mixed Effects Neural ODE: A Variational Approximation for Analyzing the
Dynamics of Panel Data [50.23363975709122]
We propose a probabilistic model called ME-NODE to incorporate (fixed + random) mixed effects for analyzing panel data.
We show that our model can be derived using smooth approximations of SDEs provided by the Wong-Zakai theorem.
We then derive Evidence Based Lower Bounds for ME-NODE, and develop (efficient) training algorithms.
arXiv Detail & Related papers (2022-02-18T22:41:51Z) - Unsupervised Domain Adaptation in the Dissimilarity Space for Person
Re-identification [11.045405206338486]
We propose a novel Dissimilarity-based Maximum Mean Discrepancy (D-MMD) loss for aligning pair-wise distances.
Empirical results with three challenging benchmark datasets show that the proposed D-MMD loss decreases as source and domain distributions become more similar.
arXiv Detail & Related papers (2020-07-27T22:10:46Z) - Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction.
We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data.
Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.