Related papers: One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models

One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models

URL: http://arxiv.org/abs/2410.22366v3
Date: Fri, 30 May 2025 14:51:51 GMT
Title: One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models
Authors: Viacheslav Surkov, Chris Wendler, Antonio Mari, Mikhail Terekhov, Justin Deschenaux, Robert West, Caglar Gulcehre, David Bau,
Abstract summary: We train SAEs to learn interpretable features for SDXL Turbo, a few-step text-to-image diffusion model.<n>We show that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks.<n>Our work is the first investigation of SAEs for interpretability in text-to-image diffusion models.
Score: 26.244291553761503
License: http://creativecommons.org/licenses/by/4.0/
Abstract: For large language models (LLMs), sparse autoencoders (SAEs) have been shown to decompose intermediate representations that often are not interpretable directly into sparse sums of interpretable features, facilitating better control and subsequent analysis. However, similar analyses and approaches have been lacking for text-to-image models. We investigate the possibility of using SAEs to learn interpretable features for SDXL Turbo, a few-step text-to-image diffusion model. To this end, we train SAEs on the updates performed by transformer blocks within SDXL Turbo's denoising U-net in its 1-step setting. Interestingly, we find that they generalize to 4-step SDXL Turbo and even to the multi-step SDXL base model (i.e., a different model) without additional training. In addition, we show that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks. We do so by creating RIEBench, a representation-based image editing benchmark, for editing images while they are generated by turning on and off individual SAE features. This allows us to track which transformer blocks' features are the most impactful depending on the edit category. Our work is the first investigation of SAEs for interpretability in text-to-image diffusion models and our results establish SAEs as a promising approach for understanding and manipulating the internal mechanisms of text-to-image models.

Related papers

Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning [2.191281369664666]
We apply Sparse Autoencoders (SAEs) and Inference-Time Decomposition of Activations (ITDA) to a text-to-image diffusion model, Flux 1.<n>SAEs accurately reconstruct residual stream embeddings and beat neurons on interpretability.<n>We find that ITDA has comparable interpretability to SAEs.
arXiv Detail & Related papers (2025-05-30T08:53:27Z)
Decoder-Only LLMs are Better Controllers for Diffusion Models [63.22040456010123]
We propose to enhance text-to-image diffusion models by borrowing the strength of semantic understanding from large language models.<n>Our adapter module is superior to the stat-of-the-art models in terms of text-to-image generation quality and reliability.
arXiv Detail & Related papers (2025-02-06T12:17:35Z)
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding [91.0552157725366]
This paper presents a novel high-performance monolithic VLM named HoVLE. It converts visual and textual inputs into a shared space, allowing LLMs to process images in the same way as texts. Our experiments show that HoVLE achieves performance close to leading compositional models on various benchmarks.
arXiv Detail & Related papers (2024-12-20T18:59:59Z)
Mimir: Improving Video Diffusion Models for Precise Text Understanding [53.72393225042688]
Text serves as the key control signal in video generation due to its narrative nature. The recent success of large language models (LLMs) showcases the power of decoder-only transformers. This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser.
arXiv Detail & Related papers (2024-12-04T07:26:44Z)
Automatically Interpreting Millions of Features in Large Language Models [1.8035046415192353]
sparse autoencoders (SAEs) can be used to transform activations into a higher-dimensional latent space. We build an open-source pipeline to generate and evaluate natural language explanations for SAE features. Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons.
arXiv Detail & Related papers (2024-10-17T17:56:01Z)
Interpreting Attention Layer Outputs with Sparse Autoencoders [3.201633659481912]
Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. In this work we train SAEs on attention layer outputs and show that also here SAEs find a sparse, interpretable decomposition. We show that Sparse Autoencoders are a useful tool that enable researchers to explain model behavior in greater detail than prior work.
arXiv Detail & Related papers (2024-06-25T17:43:13Z)
Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models [42.891427362223176]
Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities. We propose a novel framework to fully harness the capabilities of LLMs. We further design an LLM-Infused Diffusion Transformer (LI-DiT) based on the framework.
arXiv Detail & Related papers (2024-06-17T17:59:43Z)
Compositional Text-to-Image Generation with Dense Blob Representations [48.1976291999674]
Existing text-to-image models struggle to follow complex text prompts. We develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Our experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO.
arXiv Detail & Related papers (2024-05-14T00:22:06Z)
FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets. We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation. Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z)
Self-correcting LLM-controlled Diffusion Models [83.26605445217334]
We introduce Self-correcting LLM-controlled Diffusion (SLD) SLD is a framework that generates an image from the input prompt, assesses its alignment with the prompt, and performs self-corrections on the inaccuracies in the generated image. Our approach can rectify a majority of incorrect generations, particularly in generative numeracy, attribute binding, and spatial relationships.
arXiv Detail & Related papers (2023-11-27T18:56:37Z)
De-Diffusion Makes Text a Strong Cross-Modal Interface [33.90004746543745]
We employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding. Experiments validate the precision and comprehensiveness of De-Diffusion text representing images. A single De-Diffusion model can generalize to provide transferable prompts for different text-to-image tools.
arXiv Detail & Related papers (2023-11-01T16:12:40Z)
Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z)
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models. Our method leverages a pretrained large language model for grounded generation in a novel two-stage process. Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z)
Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z)
Sketch-Guided Text-to-Image Diffusion Models [57.12095262189362]
We introduce a universal approach to guide a pretrained text-to-image diffusion model. Our method does not require to train a dedicated model or a specialized encoder for the task. We take a particular focus on the sketch-to-image translation task, revealing a robust and expressive way to generate images.
arXiv Detail & Related papers (2022-11-24T18:45:32Z)
eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. We train an ensemble of text-to-image diffusion models specialized for different stages synthesis. Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
What the DAAM: Interpreting Stable Diffusion Using Cross Attention [39.97805685586423]
Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation. They remain poorly understood, lacking explainability and interpretability analyses, largely due to their proprietary, closed-source nature. We propose DAAM, a novel method based on upscaling and aggregating cross-attention activations in the latent denoising subnetwork. We show that DAAM performs strongly on caption-generated images, achieving an mIoU of 61.0, and it outperforms supervised models on open-vocabulary segmentation.
arXiv Detail & Related papers (2022-10-10T17:55:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.