Fugu-MT 論文翻訳(概要): Dynamic Chunking Diffusion Transformer

論文の概要: Dynamic Chunking Diffusion Transformer

arxiv url: http://arxiv.org/abs/2603.06351v1
Date: Fri, 06 Mar 2026 14:59:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:46.00003
Title: Dynamic Chunking Diffusion Transformer
Title（参考訳）: 動的チャンキング拡散変圧器
Authors: Akash Haridas, Utkarsh Saxena, Parsa Ashrafi Fashi, Mehdi Rezagholizadeh, Vikram Appia, Emad Barsoum,
Abstract要約: Diffusion Transformerは静的$textitpatchify$演算によって生成されるトークンの固定長シーケンスとしてイメージを処理する。本研究では, 動的チャンキング拡散変換器 (DC-DiT) を導入し, 学習したエンコーダ・ルータ・デコーダの足場でDiTのバックボーンを増強する。 DC-DiTは、均一な背景領域をより少ないトークンに圧縮し、より詳細な領域をより多くのトークンに圧縮することを学ぶ。
参考スコア（独自算出の注目度）: 16.954365273223473
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Diffusion Transformers process images as fixed-length sequences of tokens produced by a static $\textit{patchify}$ operation. While effective, this design spends uniform compute on low- and high-information regions alike, ignoring that images contain regions of varying detail and that the denoising process progresses from coarse structure at early timesteps to fine detail at late timesteps. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which augments the DiT backbone with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence in a data-dependent manner using a chunking mechanism learned end-to-end with diffusion training. The mechanism learns to compress uniform background regions into fewer tokens and detail-rich regions into more tokens, with meaningful visual segmentations emerging without explicit supervision. Furthermore, it also learns to adapt its compression across diffusion timesteps, using fewer tokens at noisy stages and more tokens as fine details emerge. On class-conditional ImageNet $256{\times}256$, DC-DiT consistently improves FID and Inception Score over both parameter-matched and FLOP-matched DiT baselines across $4{\times}$ and $16{\times}$ compression, showing this is a promising technique with potential further applications to pixel-space, video and 3D generation. Beyond accuracy, DC-DiT is practical: it can be upcycled from pretrained DiT checkpoints with minimal post-training compute (up to $8{\times}$ fewer training steps) and composes with other dynamic computation methods to further reduce generation FLOPs.
Abstract（参考訳）: Diffusion Transformerは静的$\textit{patchify}$演算によって生成されるトークンの固定長シーケンスとしてイメージを処理する。有効ではあるが、この設計は低情報領域や高情報領域の均一な計算に費やし、画像には様々な詳細領域が含まれており、デノナイズ処理は早期の粗い構造から後期の細部まで進行していることを無視する。拡散訓練により学習したチャンキング機構を用いて,2次元入力を短いトークンシーケンスに適応的に圧縮する学習エンコーダ・ルータ・デコーダの足場により,DiTバックボーンを増強する動的チャンキング拡散変換器(DC-DiT)を導入する。このメカニズムは、均一な背景領域をより少ないトークンに圧縮し、より詳細に富んだ領域をより多くのトークンに圧縮することを学ぶ。さらに、ノイズの少ない段階でトークンを減らし、詳細な詳細が現れるにつれてトークンを増やすことで、拡散タイムステップにその圧縮を適用することも学んでいる。クラス条件のImageNet $256{\times}256$では、DC-DiTはパラメータマッチングとFLOPマッチングの両方のDiTベースラインに対して、一貫してFIDとInception Scoreを改善しています。 DC-DiTは、トレーニング後の最小限の計算(最大8ドル{\times} より少ないトレーニングステップ)で事前訓練されたDiTチェックポイントからリサイクルし、他の動的計算手法と組み合わせることで、FLOPの生成をさらに削減することができる。

論文の概要: Dynamic Chunking Diffusion Transformer

関連論文リスト