Fugu-MT 論文翻訳(概要): Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model

論文の概要: Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model

arxiv url: http://arxiv.org/abs/2511.14716v1
Date: Tue, 18 Nov 2025 17:58:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-19 16:23:53.249696
Title: Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model
Title（参考訳）: 自己蒸留としての拡散:1つのモデルにおけるエンドツーエンドの潜伏拡散
Authors: Xiyuan Wang, Muhan Zhang,
Abstract要約: ラテント拡散モデルは、独立したエンコーダ、デコーダ、拡散ネットワークからなる複雑な3部アーキテクチャに依存している。本研究では,潜伏空間を安定化させる学習目標に重要な変更を加えた新たなフレームワークであるDeffusion as Self-Distillation (DSD)を提案する。このアプローチにより、単一のネットワークの安定したエンドツーエンドトレーニングが初めて実現され、同時にエンコード、デコード、拡散の実行を学ぶことができる。
参考スコア（独自算出の注目度）: 53.77953728335891
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Standard Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network, which are trained in multiple stages. This modular design is computationally inefficient, leads to suboptimal performance, and prevents the unification of diffusion with the single-network architectures common in vision foundation models. Our goal is to unify these three components into a single, end-to-end trainable network. We first demonstrate that a naive joint training approach fails catastrophically due to ``latent collapse'', where the diffusion training objective interferes with the network's ability to learn a good latent representation. We identify the root causes of this instability by drawing a novel analogy between diffusion and self-distillation based unsupervised learning method. Based on this insight, we propose Diffusion as Self-Distillation (DSD), a new framework with key modifications to the training objective that stabilize the latent space. This approach enables, for the first time, the stable end-to-end training of a single network that simultaneously learns to encode, decode, and perform diffusion. DSD achieves outstanding performance on the ImageNet $256\times 256$ conditional generation task: FID=13.44/6.38/4.25 with only 42M/118M/205M parameters and 50 training epochs on ImageNet, without using classifier-free-guidance.
Abstract（参考訳）: 標準ラテント拡散モデルは、複数の段階で訓練される独立したエンコーダ、デコーダ、拡散ネットワークからなる複雑な3部アーキテクチャに依存している。このモジュラー設計は計算的に非効率であり、最適以下の性能をもたらし、視覚基盤モデルに共通する単一ネットワークアーキテクチャとの拡散を防止している。私たちの目標は、これらの3つのコンポーネントを1つのエンドツーエンドのトレーニング可能なネットワークに統合することにあります。まず,「ラテント崩壊」が原因で,有意な共同トレーニングアプローチが破滅的に失敗することを示し,拡散訓練の対象がネットワークの優れた潜在表現を学習する能力に干渉することを示した。本研究では, この不安定性の根本原因を, 拡散と自己蒸留に基づく教師なし学習法の間に新しい類似性を描くことによって同定する。この知見に基づいて、潜伏空間を安定化させるトレーニング目的に重要な変更を加えた新しいフレームワークであるDeffusion as Self-Distillation (DSD)を提案する。このアプローチにより、単一のネットワークの安定したエンドツーエンドトレーニングが初めて実現され、同時にエンコード、デコード、拡散の実行を学ぶことができる。 FID=13.44/6.38/4.25は42M/118M/205Mパラメータと50のトレーニングエポックしか持たない。

論文の概要: Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model

関連論文リスト