An Analysis of Human Alignment of Latent Diffusion Models
- URL: http://arxiv.org/abs/2403.08469v1
- Date: Wed, 13 Mar 2024 12:31:08 GMT
- Title: An Analysis of Human Alignment of Latent Diffusion Models
- Authors: Lorenz Linhardt and Marco Morik and Sidney Bender and Naima Elosegui
Borras
- Abstract summary: Diffusion models, trained on large amounts of data, showed remarkable performance for image synthesis.
They have high error consistency with humans and low texture bias when used for classification.
We analyze how well such representations are aligned to human responses on a triplet odd-one-out task.
- Score: 4.301861805545143
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models, trained on large amounts of data, showed remarkable
performance for image synthesis. They have high error consistency with humans
and low texture bias when used for classification. Furthermore, prior work
demonstrated the decomposability of their bottleneck layer representations into
semantic directions. In this work, we analyze how well such representations are
aligned to human responses on a triplet odd-one-out task. We find that despite
the aforementioned observations: I) The representational alignment with humans
is comparable to that of models trained only on ImageNet-1k. II) The most
aligned layers of the denoiser U-Net are intermediate layers and not the
bottleneck. III) Text conditioning greatly improves alignment at high noise
levels, hinting at the importance of abstract textual information, especially
in the early stage of generation.
Related papers
- Latent Diffusion U-Net Representations Contain Positional Embeddings and Anomalies [2.1261727383260043]
We analyze popular Stable Diffusion models using representational similarity and norms.
Our findings reveal three phenomena: (1) the presence of a learned positional embedding in intermediate representations, (2) high-similarity corner artifacts, and (3) anomalous high-norm artifacts.
arXiv Detail & Related papers (2025-04-09T16:26:26Z) - LEGION: Learning to Ground and Explain for Synthetic Image Detection [49.958951540410816]
We introduce SynthScars, a high-quality and diverse dataset consisting of 12,236 fully synthetic images with human-expert annotations.
It features 4 distinct image content types, 3 categories of artifacts, and fine-grained annotations covering pixel-level segmentation, detailed textual explanations, and artifact category labels.
We propose LEGION, a multimodal large language model (MLLM)-based image forgery analysis framework that integrates artifact detection, segmentation, and explanation.
arXiv Detail & Related papers (2025-03-19T14:37:21Z) - Generalized Pose Space Embeddings for Training In-the-Wild using Anaylis-by-Synthesis [0.0]
We develop a more expressive intermediate skeleton representation capable of capturing the semantics of the pose.
We extend the analysis-by-synthesis framework with a training protocol based on synthetic data.
Our approach outperforms previous models trained with analysis-by-synthesis on standard benchmarks.
arXiv Detail & Related papers (2024-11-13T13:40:27Z) - Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
Diffusion models have dominated the field of large, generative image models.
We propose an algorithm for fast-constrained sampling in large pre-trained diffusion models.
arXiv Detail & Related papers (2024-10-24T14:52:38Z) - DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion [35.60459492849359]
We study the problem of generating intermediate images from image pairs with large motion.
Due to the large motion, the intermediate semantic information may be absent in input images.
We propose DreamMover, a novel image framework with three main components.
arXiv Detail & Related papers (2024-09-15T04:09:12Z) - SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation [69.42764583465508]
We explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks.
To the best of our knowledge, we are the first to generate both images and corresponding masks for satellite segmentation.
arXiv Detail & Related papers (2024-03-25T10:30:22Z) - Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation [59.138470433237615]
We introduce statistical metrics that quantify both the linguistic and visual skew of a dataset for relational learning.
We show that systematically controlled metrics are strongly predictive of generalization performance.
This work informs an important direction towards quality-enhancing the data diversity or balance to scaling up the absolute size.
arXiv Detail & Related papers (2024-03-25T03:18:39Z) - CRADL: Contrastive Representations for Unsupervised Anomaly Detection
and Localization [2.8659934481869715]
Unsupervised anomaly detection in medical imaging aims to detect and localize arbitrary anomalies without requiring anomalous data during training.
Most current state-of-the-art methods use latent variable generative models operating directly on the images.
We propose CRADL whose core idea is to model the distribution of normal samples directly in the low-dimensional representation space of an encoder trained with a contrastive pretext-task.
arXiv Detail & Related papers (2023-01-05T16:07:49Z) - What the DAAM: Interpreting Stable Diffusion Using Cross Attention [39.97805685586423]
Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation.
They remain poorly understood, lacking explainability and interpretability analyses, largely due to their proprietary, closed-source nature.
We propose DAAM, a novel method based on upscaling and aggregating cross-attention activations in the latent denoising subnetwork.
We show that DAAM performs strongly on caption-generated images, achieving an mIoU of 61.0, and it outperforms supervised models on open-vocabulary segmentation.
arXiv Detail & Related papers (2022-10-10T17:55:41Z) - Counterfactual Generative Networks [59.080843365828756]
We propose to decompose the image generation process into independent causal mechanisms that we train without direct supervision.
By exploiting appropriate inductive biases, these mechanisms disentangle object shape, object texture, and background.
We show that the counterfactual images can improve out-of-distribution with a marginal drop in performance on the original classification task.
arXiv Detail & Related papers (2021-01-15T10:23:12Z) - Learning Compositional Neural Information Fusion for Human Parsing [181.48380078517525]
We formulate the approach as a neural information fusion framework.
Our model assembles the information from three inference processes over the hierarchy.
The whole model is end-to-end differentiable, explicitly modeling information flows and structures.
arXiv Detail & Related papers (2020-01-19T10:35:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.