Test-Time Conditioning with Representation-Aligned Visual Features
- URL: http://arxiv.org/abs/2602.03753v1
- Date: Tue, 03 Feb 2026 17:15:03 GMT
- Title: Test-Time Conditioning with Representation-Aligned Visual Features
- Authors: Nicolas Sereyjol-Garros, Ellington Kirby, Victor Letzelter, Victor Besnier, Nermin Samet,
- Abstract summary: We introduce Representation-Aligned Guidance (REPA-G), a framework that leverages aligned representations with rich semantic properties.<n>We steer the denoising process toward a conditioned representation extracted from a pre-trained feature extractor.<n>Our method provides versatile control at multiple scales, ranging from fine-grained texture matching via single patches to broad semantic guidance.
- Score: 9.262325724962485
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While representation alignment with self-supervised models has been shown to improve diffusion model training, its potential for enhancing inference-time conditioning remains largely unexplored. We introduce Representation-Aligned Guidance (REPA-G), a framework that leverages these aligned representations, with rich semantic properties, to enable test-time conditioning from features in generation. By optimizing a similarity objective (the potential) at inference, we steer the denoising process toward a conditioned representation extracted from a pre-trained feature extractor. Our method provides versatile control at multiple scales, ranging from fine-grained texture matching via single patches to broad semantic guidance using global image feature tokens. We further extend this to multi-concept composition, allowing for the faithful combination of distinct concepts. REPA-G operates entirely at inference time, offering a flexible and precise alternative to often ambiguous text prompts or coarse class labels. We theoretically justify how this guidance enables sampling from the potential-induced tilted distribution. Quantitative results on ImageNet and COCO demonstrate that our approach achieves high-quality, diverse generations. Code is available at https://github.com/valeoai/REPA-G.
Related papers
- Training-Free Representation Guidance for Diffusion Models with a Representation Alignment Projector [14.027059904924135]
We introduce a representation alignment projector that injects representations predicted by a projector into intermediate sampling steps.<n>Experiments on SiTs and REPAs show notable improvements in class-conditional ImageNet synthesis.<n>The proposed method outperforms representative guidance when applied to SiT models.
arXiv Detail & Related papers (2026-01-30T02:29:54Z) - GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation [51.95701097588426]
We introduce a Global Perspective Tokenizer (GloTok) to model a more uniform semantic distribution of tokenized features.<n>A residual learning module is proposed to recover the fine-grained details to minimize the reconstruction error caused by quantization.<n>Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.
arXiv Detail & Related papers (2025-11-18T06:40:26Z) - Unsupervised Representation Learning by Balanced Self Attention Matching [2.3020018305241337]
We present a self-supervised method for embedding image features called BAM.
We obtain rich representations and avoid feature collapse by minimizing a loss that matches these distributions to their globally balanced and entropy regularized version.
We show competitive performance with leading methods on both semi-supervised and transfer-learning benchmarks.
arXiv Detail & Related papers (2024-08-04T12:52:44Z) - Readout Guidance: Learning Control from Diffusion Features [96.22155562120231]
We present Readout Guidance, a method for controlling text-to-image diffusion models with learned signals.
Readout Guidance uses readout heads, lightweight networks trained to extract signals from the features of a pre-trained, frozen diffusion model at every timestep.
These readouts can encode single-image properties, such as pose, depth, and edges; or higher-order properties that relate multiple images, such as correspondence and appearance similarity.
arXiv Detail & Related papers (2023-12-04T18:59:32Z) - DiffuseGAE: Controllable and High-fidelity Image Manipulation from
Disentangled Representation [14.725538019917625]
Diffusion probabilistic models (DPMs) have shown remarkable results on various image synthesis tasks.
DPMs lack a low-dimensional, interpretable, and well-decoupled latent code.
We propose Diff-AE to explore the potential of DPMs for representation learning via autoencoding.
arXiv Detail & Related papers (2023-07-12T04:11:08Z) - End-to-End Diffusion Latent Optimization Improves Classifier Guidance [81.27364542975235]
Direct Optimization of Diffusion Latents (DOODL) is a novel guidance method.
It enables plug-and-play guidance by optimizing diffusion latents.
It outperforms one-step classifier guidance on computational and human evaluation metrics.
arXiv Detail & Related papers (2023-03-23T22:43:52Z) - Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.<n>Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches.<n>We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - Diverse Semantic Image Synthesis via Probability Distribution Modeling [103.88931623488088]
We propose a novel diverse semantic image synthesis framework.
Our method can achieve superior diversity and comparable quality compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-03-11T18:59:25Z) - Unsupervised Discovery of Disentangled Manifolds in GANs [74.24771216154105]
Interpretable generation process is beneficial to various image editing applications.
We propose a framework to discover interpretable directions in the latent space given arbitrary pre-trained generative adversarial networks.
arXiv Detail & Related papers (2020-11-24T02:18:08Z) - Generalized Adversarially Learned Inference [42.40405470084505]
We develop methods of inference of latent variables in GANs by adversarially training an image generator along with an encoder to match two joint distributions of image and latent vector pairs.
We incorporate multiple layers of feedback on reconstructions, self-supervision, and other forms of supervision based on prior or learned knowledge about the desired solutions.
arXiv Detail & Related papers (2020-06-15T02:18:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.