Fugu-MT 論文翻訳(概要): Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers

論文の概要: Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers

arxiv url: http://arxiv.org/abs/2510.11538v1
Date: Mon, 13 Oct 2025 15:39:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:30.43613
Title: Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers
Title（参考訳）: 拡散変圧器における局所的詳細合成の鍵となる大量活性化
Authors: Chaofan Gan, Zicheng Zhao, Yuanpeng Tu, Xi Chen, Ziran Qin, Tieyuan Chen, Mehrtash Harandi, Weiyao Lin,
Abstract要約: Diffusion Transformers (DiT) は視覚生成の強力なバックボーンとして登場した。近年の観察では, 内部特徴マップにemphMassive Activations (MA) が出現している。ローカルなディテール忠実度を高めるためにtextbfDetail textbfGuidance (textbfDG) を提案する。
参考スコア（独自算出の注目度）: 33.765941209545986
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for visual generation. Recent observations reveal \emph{Massive Activations} (MAs) in their internal feature maps, yet their function remains poorly understood. In this work, we systematically investigate these activations to elucidate their role in visual generation. We found that these massive activations occur across all spatial tokens, and their distribution is modulated by the input timestep embeddings. Importantly, our investigations further demonstrate that these massive activations play a key role in local detail synthesis, while having minimal impact on the overall semantic content of output. Building on these insights, we propose \textbf{D}etail \textbf{G}uidance (\textbf{DG}), a MAs-driven, training-free self-guidance strategy to explicitly enhance local detail fidelity for DiTs. Specifically, DG constructs a degraded ``detail-deficient'' model by disrupting MAs and leverages it to guide the original network toward higher-quality detail synthesis. Our DG can seamlessly integrate with Classifier-Free Guidance (CFG), enabling further refinements of fine-grained details. Extensive experiments demonstrate that our DG consistently improves fine-grained detail quality across various pre-trained DiTs (\eg, SD3, SD3.5, and Flux).
Abstract（参考訳）: Diffusion Transformers (DiT) は視覚生成の強力なバックボーンとして最近登場した。最近の観測では、内部特徴写像に 'emph{Massive Activations} (MAs) が示されているが、その機能はよく分かっていない。本研究では,視覚生成におけるそれらの役割を明らかにするために,これらの活性化を体系的に検討する。これらの大きな活性化はすべての空間トークンで起こり、それらの分布は入力時間ステップの埋め込みによって変調される。重要なことは、これらの大規模な活性化が局所的な詳細合成において重要な役割を担いながら、アウトプット全体の意味的内容に最小限の影響を及ぼすことである。これらの知見に基づいて、DETの局所的詳細忠実度を明確に向上するMAs駆動の訓練自由自己指導戦略である \textbf{D}etail \textbf{G}uidance (\textbf{DG}uidance (\textbf{DG})を提案する。具体的には、DG は MA を乱すことによって劣化した `detail-deficient'' モデルを構築し、それを利用して元のネットワークを高品質な詳細合成へ導く。我々の DG は Classifier-Free Guidance (CFG) とシームレスに統合することができ、細かな詳細のさらなる改善を可能にします。我々のDGは、トレーニング済みのDiT(\eg, SD3, SD3.5, Flux)の細かなディテール品質を一貫して改善することを示した。

論文の概要: Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers

関連論文リスト