Related papers: Training-Free Representation Guidance for Diffusion Models with a Representation Alignment Projector

Training-Free Representation Guidance for Diffusion Models with a Representation Alignment Projector

URL: http://arxiv.org/abs/2601.22468v1
Date: Fri, 30 Jan 2026 02:29:54 GMT
Title: Training-Free Representation Guidance for Diffusion Models with a Representation Alignment Projector
Authors: Wenqiang Zu, Shenghao Xie, Bo Lei, Lei Ma,
Abstract summary: We introduce a representation alignment projector that injects representations predicted by a projector into intermediate sampling steps.<n>Experiments on SiTs and REPAs show notable improvements in class-conditional ImageNet synthesis.<n>The proposed method outperforms representative guidance when applied to SiT models.
Score: 14.027059904924135
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent progress in generative modeling has enabled high-quality visual synthesis with diffusion-based frameworks, supporting controllable sampling and large-scale training. Inference-time guidance methods such as classifier-free and representative guidance enhance semantic alignment by modifying sampling dynamics; however, they do not fully exploit unsupervised feature representations. Although such visual representations contain rich semantic structure, their integration during generation is constrained by the absence of ground-truth reference images at inference. This work reveals semantic drift in the early denoising stages of diffusion transformers, where stochasticity results in inconsistent alignment even under identical conditioning. To mitigate this issue, we introduce a guidance scheme using a representation alignment projector that injects representations predicted by a projector into intermediate sampling steps, providing an effective semantic anchor without modifying the model architecture. Experiments on SiTs and REPAs show notable improvements in class-conditional ImageNet synthesis, achieving substantially lower FID scores; for example, REPA-XL/2 improves from 5.9 to 3.3, and the proposed method outperforms representative guidance when applied to SiT models. The approach further yields complementary gains when combined with classifier-free guidance, demonstrating enhanced semantic coherence and visual fidelity. These results establish representation-informed diffusion sampling as a practical strategy for reinforcing semantic preservation and image consistency.

Related papers

Disentangled representations via score-based variational autoencoders [21.955536401578616]
We present the Score-based Autoencoder for Multiscale Inference (SAMI)<n>SAMI formulates a principled objective that learns representations through score-based guidance of the underlying diffusion process.<n>It can extract useful representations from pre-trained diffusion models with minimal additional training.
arXiv Detail & Related papers (2025-12-18T23:42:10Z)
Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representational Alignment [13.028121107802127]
In inverse problems, pretrained generative models are employed as priors.<n>We propose applying representation alignment (REPA) between diffusion or flow-based models and a pretrained self-supervised visual encoder.<n>We show that aligning model representations with approximate target features can substantially enhance reconstruction fidelity and perceptual realism.
arXiv Detail & Related papers (2025-11-21T00:37:04Z)
TAG:Tangential Amplifying Guidance for Hallucination-Resistant Diffusion Sampling [53.61290359948953]
Tangential Amplifying Guidance (TAG) operates solely on trajectory signals without modifying the underlying diffusion model.<n>We formalize this guidance process by leveraging a first-order Taylor expansion.<n> TAG is a plug-and-play, architecture-agnostic module that improves diffusion sampling fidelity with minimal computational addition.
arXiv Detail & Related papers (2025-10-06T06:53:29Z)
Cross-Subject Mind Decoding from Inaccurate Representations [42.19569985029642]
We propose a Bi Autoencoder Intertwining framework for accurate decoded representation prediction.<n>Our method outperforms state-of-the-art approaches on benchmark datasets in both qualitative and quantitative evaluations.
arXiv Detail & Related papers (2025-07-25T08:45:02Z)
Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z)
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think [72.48325960659822]
One main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations.<n>We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders.<n>The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs.
arXiv Detail & Related papers (2024-10-09T14:34:53Z)
Exploring Compositional Visual Generation with Latent Classifier Guidance [19.48538300223431]
We train latent diffusion models and auxiliary latent classifiers to facilitate non-linear navigation of latent representation generation. We show that such conditional generation achieved by latent classifier guidance provably maximizes a lower bound of the conditional log probability during training. We show that this paradigm based on latent classifier guidance is agnostic to pre-trained generative models, and present competitive results for both image generation and sequential manipulation of real and synthetic images.
arXiv Detail & Related papers (2023-04-25T03:02:58Z)
StraIT: Non-autoregressive Generation with Stratified Image Transformer [63.158996766036736]
Stratified Image Transformer(StraIT) is a pure non-autoregressive(NAR) generative model. Our experiments demonstrate that StraIT significantly improves NAR generation and out-performs existing DMs and AR methods.
arXiv Detail & Related papers (2023-03-01T18:59:33Z)
Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.<n>Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches.<n>We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z)
High-Fidelity Synthesis with Disentangled Representation [60.19657080953252]
We propose an Information-Distillation Generative Adrial Network (ID-GAN) for disentanglement learning and high-fidelity synthesis. Our method learns disentangled representation using VAE-based models, and distills the learned representation with an additional nuisance variable to the separate GAN-based generator for high-fidelity synthesis. Despite the simplicity, we show that the proposed method is highly effective, achieving comparable image generation quality to the state-of-the-art methods using the disentangled representation.
arXiv Detail & Related papers (2020-01-13T14:39:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.