Fugu-MT 論文翻訳(概要): Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation

論文の概要: Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation

arxiv url: http://arxiv.org/abs/2506.18999v1
Date: Mon, 23 Jun 2025 18:01:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-25 19:48:23.329741
Title: Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation
Title（参考訳）: 高分解能画像生成のための拡散変圧器-マンバ蒸留
Authors: Yuan Yao, Yicong Hong, Difan Liu, Long Mai, Feng Liu, Jiebo Luo,
Abstract要約: 本稿では,効率的なトレーニングパイプラインを形成するための拡散変圧器-タンバ蒸留(T2MD)について紹介する。我々は,効率とグローバルな依存関係を同時に達成する拡散自己注意とマンバハイブリッドモデルを確立する。実験により、トレーニングパスはオーバーヘッドが低く、高品質のテキスト・ツー・イメージ生成につながることが示された。
参考スコア（独自算出の注目度）: 65.46359545280546
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The quadratic computational complexity of self-attention in diffusion transformers (DiT) introduces substantial computational costs in high-resolution image generation. While the linear-complexity Mamba model emerges as a potential alternative, direct Mamba training remains empirically challenging. To address this issue, this paper introduces diffusion transformer-to-mamba distillation (T2MD), forming an efficient training pipeline that facilitates the transition from the self-attention-based transformer to the linear complexity state-space model Mamba. We establish a diffusion self-attention and Mamba hybrid model that simultaneously achieves efficiency and global dependencies. With the proposed layer-level teacher forcing and feature-based knowledge distillation, T2MD alleviates the training difficulty and high cost of a state space model from scratch. Starting from the distilled 512$\times$512 resolution base model, we push the generation towards 2048$\times$2048 images via lightweight adaptation and high-resolution fine-tuning. Experiments demonstrate that our training path leads to low overhead but high-quality text-to-image generation. Importantly, our results also justify the feasibility of using sequential and causal Mamba models for generating non-causal visual output, suggesting the potential for future exploration.
Abstract（参考訳）: 拡散変圧器(DiT)における自己注意の2次計算複雑性は、高解像度画像生成においてかなりの計算コストをもたらす。線形複雑なマンバモデルは潜在的な代替として現れるが、直接マンバ訓練は実証的に困難である。本稿では,拡散変圧器-タンバ蒸留(T2MD)を導入し,自己注意型変圧器から線形複雑性状態空間モデルMambaへの移行を容易にする効率的な訓練パイプラインを構築した。我々は,効率とグローバルな依存関係を同時に達成する拡散自己注意とマンバハイブリッドモデルを確立する。提案した階層レベルの教師強制と特徴に基づく知識蒸留により、T2MDは、訓練の難しさと、状態空間モデルのコストをゼロから軽減する。蒸留した512$\times$512の解像度ベースモデルから始めて、2048$\times$2048の画像に、軽量な適応と高解像度の微調整を施して、生成を推し進める。実験により、トレーニングパスはオーバーヘッドが低く、高品質のテキスト・ツー・イメージ生成につながることが示された。また,本研究の結果は,非因果的視覚出力を生成するための逐次的および因果的マンバモデルの有用性を正当化し,今後の探索の可能性も示唆している。

論文の概要: Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation

関連論文リスト