Fugu-MT 論文翻訳(概要): One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

論文の概要: One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

arxiv url: http://arxiv.org/abs/2603.12245v1
Date: Thu, 12 Mar 2026 17:57:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:26.280181
Title: One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers
Title（参考訳）: 1モデル, 多くの予算:拡散変圧器用弾性潜時インタフェース
Authors: Moayed Haji-Ali, Willi Menapace, Ivan Skorokhodov, Dogyun Park, Anil Kag, Michael Vasilkovsky, Sergey Tulyakov, Vicente Ordonez, Aliaksandr Siarohin,
Abstract要約: Elastic Latent Interface Transformer (ELIT) は、入力画像サイズを計算から切り離すための、ドロップインでDiT互換のメカニズムである。読み取りと書き込みクロスアテンション・レイヤは空間トークンとラテントの間で情報を移動し、重要な入力領域を優先する。 ImageNet-1K 512pxでは、ELITの平均利得は35.3%、FIDおよびFDDスコアは39.6%である。
参考スコア（独自算出の注目度）: 80.19461768457622
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of $35.3\%$ and $39.6\%$ in FID and FDD scores. Project page: https://snap-research.github.io/elit/
Abstract（参考訳）: 拡散変換器(DiT)は、高生成品質を実現するが、FLOPを画像解像度にロックし、原理化されたレイテンシ品質のトレードオフを制限し、入力空間トークン間で一様に計算を割り当て、重要でない領域へのリソース割り当てを無駄にする。本稿では,入力画像サイズを計算から切り離すためのDET互換機構であるElastic Latent Interface Transformer (ELIT)を紹介する。提案手法では,標準トランスフォーマーブロックが動作可能な,学習可能な可変長トークンシーケンスである潜時インタフェースを挿入する。 Lightweight Read and Write Cross-attention Layerは、空間トークンと潜者の間で情報を移動し、重要な入力領域を優先する。末尾のレイトントをランダムに落としてトレーニングすることにより、ELITはより初期のレイトントがグローバル構造をキャプチャし、後続のレイトントが詳細を洗練するための情報を含む、重要順序の表現を生成することを学ぶ。推論では、計算制約に合うようにラテントの数を動的に調整することができる。 ELITは意図的に最小限であり、2つのクロスアテンション層を追加しつつ、修正フローの目的とDiTスタックをそのままにしておく。データセットとアーキテクチャ全体(DiT、U-ViT、HDiT、MM-DiT)において、ELITは一貫性のあるゲインを提供する。 ImageNet-1K 512pxでは、ELITは平均35.3\%$と39.6\%$のFIDとFDDスコアを提供する。プロジェクトページ: https://snap-research.github.io/elit/

論文の概要: One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

関連論文リスト