Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning
- URL: http://arxiv.org/abs/2505.24360v3
- Date: Thu, 10 Jul 2025 23:59:56 GMT
- Title: Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning
- Authors: Stepan Shabalin, Ayush Panda, Dmitrii Kharlapenko, Abdur Raheem Ali, Yixiong Hao, Arthur Conmy,
- Abstract summary: We apply Sparse Autoencoders (SAEs) and Inference-Time Decomposition of Activations (ITDA) to a text-to-image diffusion model, Flux 1.<n>SAEs accurately reconstruct residual stream embeddings and beat neurons on interpretability.<n>We find that ITDA has comparable interpretability to SAEs.
- Score: 2.191281369664666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse autoencoders are a promising new approach for decomposing language model activations for interpretation and control. They have been applied successfully to vision transformer image encoders and to small-scale diffusion models. Inference-Time Decomposition of Activations (ITDA) is a recently proposed variant of dictionary learning that takes the dictionary to be a set of data points from the activation distribution and reconstructs them with gradient pursuit. We apply Sparse Autoencoders (SAEs) and ITDA to a large text-to-image diffusion model, Flux 1, and consider the interpretability of embeddings of both by introducing a visual automated interpretation pipeline. We find that SAEs accurately reconstruct residual stream embeddings and beat MLP neurons on interpretability. We are able to use SAE features to steer image generation through activation addition. We find that ITDA has comparable interpretability to SAEs.
Related papers
- TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation [34.73820805875123]
TIDE (Temporal-aware Sparse Autoencoders for Interpretable Diffusion transformErs) is a novel framework that enhances temporal reconstruction within DiT activation layers across denoising steps.<n>TIDE employs Sparse Autoencoders (SAEs) with a sparse bottleneck layer to extract interpretable and hierarchical features.<n>Our approach achieves state-of-the-art reconstruction performance, with a mean squared error (MSE) of 1e-3 and a cosine similarity of 0.97.
arXiv Detail & Related papers (2025-03-10T08:35:51Z) - One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models [26.244291553761503]
We train SAEs to learn interpretable features for SDXL Turbo, a few-step text-to-image diffusion model.<n>We show that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks.<n>Our work is the first investigation of SAEs for interpretability in text-to-image diffusion models.
arXiv Detail & Related papers (2024-10-28T19:01:18Z) - MirrorDiffusion: Stabilizing Diffusion Process in Zero-shot Image
Translation by Prompts Redescription and Beyond [57.14128305383768]
We propose a prompt redescription strategy to realize a mirror effect between the source and reconstructed image in the diffusion model (MirrorDiffusion)
MirrorDiffusion achieves superior performance over the state-of-the-art methods on zero-shot image translation benchmarks.
arXiv Detail & Related papers (2024-01-06T14:12:16Z) - SODA: Bottleneck Diffusion Models for Representation Learning [75.7331354734152]
We introduce SODA, a self-supervised diffusion model, designed for representation learning.
The model incorporates an image encoder, which distills a source view into a compact representation, that guides the generation of related novel views.
We show that by imposing a tight bottleneck between the encoder and a denoising decoder, we can turn diffusion models into strong representation learners.
arXiv Detail & Related papers (2023-11-29T18:53:34Z) - Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text.
Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information.
experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z) - Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language.
We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z) - Denoising Diffusion Autoencoders are Unified Self-supervised Learners [58.194184241363175]
This paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), are unified self-supervised learners.
DDAE has already learned strongly linear-separable representations within its intermediate layers without auxiliary encoders.
Our diffusion-based approach achieves 95.9% and 50.0% linear evaluation accuracies on CIFAR-10 and Tiny-ImageNet.
arXiv Detail & Related papers (2023-03-17T04:20:47Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - Semantic-Conditional Diffusion Networks for Image Captioning [116.86677915812508]
We propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net)
In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence.
Experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task.
arXiv Detail & Related papers (2022-12-06T16:08:16Z) - What the DAAM: Interpreting Stable Diffusion Using Cross Attention [39.97805685586423]
Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation.
They remain poorly understood, lacking explainability and interpretability analyses, largely due to their proprietary, closed-source nature.
We propose DAAM, a novel method based on upscaling and aggregating cross-attention activations in the latent denoising subnetwork.
We show that DAAM performs strongly on caption-generated images, achieving an mIoU of 61.0, and it outperforms supervised models on open-vocabulary segmentation.
arXiv Detail & Related papers (2022-10-10T17:55:41Z) - IntroVAC: Introspective Variational Classifiers for Learning
Interpretable Latent Subspaces [6.574517227976925]
IntroVAC learns interpretable latent subspaces by exploiting information from an additional label.
We show that IntroVAC is able to learn meaningful directions in the latent space enabling fine manipulation of image attributes.
arXiv Detail & Related papers (2020-08-03T10:21:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.