Latent Diffusion U-Net Representations Contain Positional Embeddings and Anomalies
- URL: http://arxiv.org/abs/2504.07008v1
- Date: Wed, 09 Apr 2025 16:26:26 GMT
- Title: Latent Diffusion U-Net Representations Contain Positional Embeddings and Anomalies
- Authors: Jonas Loos, Lorenz Linhardt,
- Abstract summary: We analyze popular Stable Diffusion models using representational similarity and norms.<n>Our findings reveal three phenomena: (1) the presence of a learned positional embedding in intermediate representations, (2) high-similarity corner artifacts, and (3) anomalous high-norm artifacts.
- Score: 2.1261727383260043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models have demonstrated remarkable capabilities in synthesizing realistic images, spurring interest in using their representations for various downstream tasks. To better understand the robustness of these representations, we analyze popular Stable Diffusion models using representational similarity and norms. Our findings reveal three phenomena: (1) the presence of a learned positional embedding in intermediate representations, (2) high-similarity corner artifacts, and (3) anomalous high-norm artifacts. These findings underscore the need to further investigate the properties of diffusion model representations before considering them for downstream tasks that require robust features. Project page: https://jonasloos.github.io/sd-representation-anomalies
Related papers
- Diffusion Counterfactuals for Image Regressors [1.534667887016089]
We present two methods to create counterfactual explanations for image regression tasks using diffusion-based generative models.<n>Both produce realistic, semantic, and smooth counterfactuals on CelebA-HQ and a synthetic data set.<n>We find that for regression counterfactuals, changes in features depend on the region of the predicted value.
arXiv Detail & Related papers (2025-03-26T14:42:46Z) - Comparative Analysis of Generative Models: Enhancing Image Synthesis with VAEs, GANs, and Stable Diffusion [0.0]
This paper examines three major generative modelling frameworks: Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs) and Stable Diffusion models.
arXiv Detail & Related papers (2024-08-16T13:50:50Z) - Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition [57.74076383449153]
We propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++.
It models two common event representations simultaneously, i.e., event images and event voxels.
We achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51%$, which exceeds the second place by $+2.21%$.
arXiv Detail & Related papers (2024-06-27T02:32:46Z) - An Analysis of Human Alignment of Latent Diffusion Models [4.301861805545143]
Diffusion models, trained on large amounts of data, showed remarkable performance for image synthesis.
They have high error consistency with humans and low texture bias when used for classification.
We analyze how well such representations are aligned to human responses on a triplet odd-one-out task.
arXiv Detail & Related papers (2024-03-13T12:31:08Z) - Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement [58.9768112704998]
Disentangled representation learning strives to extract the intrinsic factors within observed data.
We introduce a new perspective and framework, demonstrating that diffusion models with cross-attention can serve as a powerful inductive bias.
This is the first work to reveal the potent disentanglement capability of diffusion models with cross-attention, requiring no complex designs.
arXiv Detail & Related papers (2024-02-15T05:07:54Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Exploiting Diffusion Prior for Generalizable Dense Prediction [85.4563592053464]
Recent advanced Text-to-Image (T2I) diffusion models are sometimes too imaginative for existing off-the-shelf dense predictors to estimate.
We introduce DMP, a pipeline utilizing pre-trained T2I models as a prior for dense prediction tasks.
Despite limited-domain training data, the approach yields faithful estimations for arbitrary images, surpassing existing state-of-the-art algorithms.
arXiv Detail & Related papers (2023-11-30T18:59:44Z) - A General Protocol to Probe Large Vision Models for 3D Physical Understanding [84.54972153436466]
We introduce a general protocol to evaluate whether features of an off-the-shelf large vision model encode a number of physical 'properties' of the 3D scene.
We apply this protocol to properties covering scene geometry, scene material, support relations, lighting, and view-dependent measures.
We find that features from Stable Diffusion and DINOv2 are good for discriminative learning of a number of properties.
arXiv Detail & Related papers (2023-10-10T17:59:28Z) - Beyond Surface Statistics: Scene Representations in a Latent Diffusion
Model [52.634378583311054]
Latent diffusion models (LDMs) produce realistic images, yet the inner workings of these models remain mysterious.
In this work, we investigate a basic interpretability question: does an LDM create and use an internal representation of simple scene geometry?
Using linear probes, we find evidence that the internal activations of the LDM encode linear representations of both 3D depth data and a salient-object / background distinction.
arXiv Detail & Related papers (2023-06-09T07:34:34Z) - A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot
Semantic Correspondence [83.90531416914884]
We exploit Stable Diffusion features for semantic and dense correspondence.
With simple post-processing, SD features can perform quantitatively similar to SOTA representations.
We show that these correspondences can enable interesting applications such as instance swapping in two images.
arXiv Detail & Related papers (2023-05-24T16:59:26Z) - Intriguing properties of synthetic images: from generative adversarial
networks to diffusion models [19.448196464632]
It is important to gain insight into which image features better discriminate fake images from real ones.
In this paper we report on our systematic study of a large number of image generators of different families, aimed at discovering the most forensically relevant characteristics of real and generated images.
arXiv Detail & Related papers (2023-04-13T11:13:19Z) - CRADL: Contrastive Representations for Unsupervised Anomaly Detection
and Localization [2.8659934481869715]
Unsupervised anomaly detection in medical imaging aims to detect and localize arbitrary anomalies without requiring anomalous data during training.
Most current state-of-the-art methods use latent variable generative models operating directly on the images.
We propose CRADL whose core idea is to model the distribution of normal samples directly in the low-dimensional representation space of an encoder trained with a contrastive pretext-task.
arXiv Detail & Related papers (2023-01-05T16:07:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.