Latent Forcing: Reordering the Diffusion Trajectory for Pixel-Space Image Generation
- URL: http://arxiv.org/abs/2602.11401v1
- Date: Wed, 11 Feb 2026 22:09:58 GMT
- Title: Latent Forcing: Reordering the Diffusion Trajectory for Pixel-Space Image Generation
- Authors: Alan Baade, Eric Ryan Chan, Kyle Sargent, Changan Chen, Justin Johnson, Ehsan Adeli, Li Fei-Fei,
- Abstract summary: Latent diffusion models excel at generating high-quality images but lose the benefits of end-to-end modeling.<n>We propose Latent Forcing, a simple modification to existing architectures that achieves the efficiency of latent diffusion while operating on raw natural images.<n>Latent Forcing achieves a new state-of-the-art for diffusion transformer-based pixel generation at our compute scale.
- Score: 36.41177812868683
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Latent diffusion models excel at generating high-quality images but lose the benefits of end-to-end modeling. They discard information during image encoding, require a separately trained decoder, and model an auxiliary distribution to the raw data. In this paper, we propose Latent Forcing, a simple modification to existing architectures that achieves the efficiency of latent diffusion while operating on raw natural images. Our approach orders the denoising trajectory by jointly processing latents and pixels with separately tuned noise schedules. This allows the latents to act as a scratchpad for intermediate computation before high-frequency pixel features are generated. We find that the order of conditioning signals is critical, and we analyze this to explain differences between REPA distillation in the tokenizer and the diffusion model, conditional versus unconditional generation, and how tokenizer reconstruction quality relates to diffusability. Applied to ImageNet, Latent Forcing achieves a new state-of-the-art for diffusion transformer-based pixel generation at our compute scale.
Related papers
- ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion [7.233066974580282]
Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution.<n>Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models.<n>We propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training.
arXiv Detail & Related papers (2025-10-29T17:17:32Z) - FARMER: Flow AutoRegressive Transformer over Pixels [39.864972164994946]
We present a novel end-to-end generative framework that unifies Normalizing Flows (NF) and Autoregressive (AR) models.<n> FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is modeled implicitly by an autoregressive model.<n>We show that FARMER achieves competitive performance compared to existing pixel-based generative models.
arXiv Detail & Related papers (2025-10-27T17:54:08Z) - DeltaDiff: Reality-Driven Diffusion with AnchorResiduals for Faithful SR [10.790771977682763]
We propose DeltaDiff, a novel frame.work that constrains the difusion process.<n>Our method surpasses state-of-the-art models and generates re-sults with better fidelity.<n>This work establishes a new low-rank constrained par-adigm for applying diffusion models to image reconstruction tasks.
arXiv Detail & Related papers (2025-02-18T06:07:14Z) - Distilling Diffusion Models into Conditional GANs [90.76040478677609]
We distill a complex multistep diffusion model into a single-step conditional GAN student model.
For efficient regression loss, we propose E-LatentLPIPS, a perceptual loss operating directly in diffusion model's latent space.
We demonstrate that our one-step generator outperforms cutting-edge one-step diffusion distillation models.
arXiv Detail & Related papers (2024-05-09T17:59:40Z) - Improving Diffusion-Based Image Synthesis with Context Prediction [49.186366441954846]
Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes.
We propose ConPreDiff to improve diffusion-based image synthesis with context prediction.
Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.
arXiv Detail & Related papers (2024-01-04T01:10:56Z) - Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional
Image Synthesis [62.07413805483241]
Steered Diffusion is a framework for zero-shot conditional image generation using a diffusion model trained for unconditional generation.
We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution.
arXiv Detail & Related papers (2023-09-30T02:03:22Z) - Gradpaint: Gradient-Guided Inpainting with Diffusion Models [71.47496445507862]
Denoising Diffusion Probabilistic Models (DDPMs) have recently achieved remarkable results in conditional and unconditional image generation.
We present GradPaint, which steers the generation towards a globally coherent image.
We generalizes well to diffusion models trained on various datasets, improving upon current state-of-the-art supervised and unsupervised methods.
arXiv Detail & Related papers (2023-09-18T09:36:24Z) - Low-Light Image Enhancement with Wavelet-based Diffusion Models [50.632343822790006]
Diffusion models have achieved promising results in image restoration tasks, yet suffer from time-consuming, excessive computational resource consumption, and unstable restoration.
We propose a robust and efficient Diffusion-based Low-Light image enhancement approach, dubbed DiffLL.
arXiv Detail & Related papers (2023-06-01T03:08:28Z) - Hierarchical Integration Diffusion Model for Realistic Image Deblurring [71.76410266003917]
Diffusion models (DMs) have been introduced in image deblurring and exhibited promising performance.
We propose the Hierarchical Integration Diffusion Model (HI-Diff), for realistic image deblurring.
Experiments on synthetic and real-world blur datasets demonstrate that our HI-Diff outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T12:18:20Z) - Lossy Image Compression with Conditional Diffusion Models [25.158390422252097]
This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models.
In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model.
Our approach yields stronger reported FID scores than the GAN-based model, while also yielding competitive performance with VAE-based models in several distortion metrics.
arXiv Detail & Related papers (2022-09-14T21:53:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.