Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model
- URL: http://arxiv.org/abs/2511.14716v1
- Date: Tue, 18 Nov 2025 17:58:16 GMT
- Title: Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model
- Authors: Xiyuan Wang, Muhan Zhang,
- Abstract summary: Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network.<n>We propose Diffusion as Self-Distillation (DSD), a new framework with key modifications to the training objective that stabilize the latent space.<n>This approach enables, for the first time, the stable end-to-end training of a single network that simultaneously learns to encode, decode, and perform diffusion.
- Score: 53.77953728335891
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Standard Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network, which are trained in multiple stages. This modular design is computationally inefficient, leads to suboptimal performance, and prevents the unification of diffusion with the single-network architectures common in vision foundation models. Our goal is to unify these three components into a single, end-to-end trainable network. We first demonstrate that a naive joint training approach fails catastrophically due to ``latent collapse'', where the diffusion training objective interferes with the network's ability to learn a good latent representation. We identify the root causes of this instability by drawing a novel analogy between diffusion and self-distillation based unsupervised learning method. Based on this insight, we propose Diffusion as Self-Distillation (DSD), a new framework with key modifications to the training objective that stabilize the latent space. This approach enables, for the first time, the stable end-to-end training of a single network that simultaneously learns to encode, decode, and perform diffusion. DSD achieves outstanding performance on the ImageNet $256\times 256$ conditional generation task: FID=13.44/6.38/4.25 with only 42M/118M/205M parameters and 50 training epochs on ImageNet, without using classifier-free-guidance.
Related papers
- Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models [7.17300076441681]
SurgUn is a surgical unlearning method that applies targeted weight-space updates to remove specific visual concepts in text-conditioned diffusion models.<n>Our approach is motivated by retroactive interference theory, which holds that newly acquired memories can overwrite, suppress, or impede access to prior ones.<n>We adapt this principle to diffusion models by inducing retroactive concept interference, enabling focused destabilization of only the target concept.
arXiv Detail & Related papers (2026-03-01T08:07:14Z) - MeanFlow Transformers with Representation Autoencoders [71.45823902973349]
MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data.<n>We develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE)<n>We achieve a 1-step FID of 2.03, outperforming vanilla MF's 3.43, while reducing sampling GFLOPS by 38% and total training cost by 83% on ImageNet 256.
arXiv Detail & Related papers (2025-11-17T06:17:08Z) - REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers [52.55041244336767]
Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible.<n>For latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective.<n>We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss.
arXiv Detail & Related papers (2025-04-14T17:59:53Z) - CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models [77.39903417768967]
CatVTON is a virtual try-on diffusion model that transfers in-shop or worn garments of arbitrary categories to target individuals.<n>CatVTON consists only of a VAE and a simplified denoising UNet, removing redundant image and text encoders.<n>Experiments demonstrate that CatVTON achieves superior qualitative and quantitative results compared to baseline methods.
arXiv Detail & Related papers (2024-07-21T11:58:53Z) - Guided Diffusion from Self-Supervised Diffusion Features [49.78673164423208]
Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or pretraining.
We propose a framework to extract guidance from, and specifically for, diffusion models.
arXiv Detail & Related papers (2023-12-14T11:19:11Z) - Denoising Diffusion Autoencoders are Unified Self-supervised Learners [58.194184241363175]
This paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), are unified self-supervised learners.
DDAE has already learned strongly linear-separable representations within its intermediate layers without auxiliary encoders.
Our diffusion-based approach achieves 95.9% and 50.0% linear evaluation accuracies on CIFAR-10 and Tiny-ImageNet.
arXiv Detail & Related papers (2023-03-17T04:20:47Z) - A Generic Shared Attention Mechanism for Various Backbone Neural Networks [53.36677373145012]
Self-attention modules (SAMs) produce strongly correlated attention maps across different layers.
Dense-and-Implicit Attention (DIA) shares SAMs across layers and employs a long short-term memory module.
Our simple yet effective DIA can consistently enhance various network backbones.
arXiv Detail & Related papers (2022-10-27T13:24:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.