Fugu-MT 論文翻訳(概要): ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

論文の概要: ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

arxiv url: http://arxiv.org/abs/2605.15684v1
Date: Fri, 15 May 2026 07:13:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 03:45:13.162563
Title: ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices
Title（参考訳）: ElasticDiT: モバイルデバイス上での高分解能画像生成のための弾性構造とスパースアテンションによる効率的な拡散変換器
Authors: Kunpeng Du, Haizhen Xie, Sen Lu, Lei Yu, Binglei Bao, Huaao Tang, Chuntao Liu, Hao Wu, Yang Zhao, Zhicai Huang, Heyuan Gao, Zhijun Tu, Jie Hu, Xinghao Chen,
Abstract要約: Diffusion Transformer (DiT) アーキテクチャは、高忠実度画像生成のための最先端パラダイムである。しかし、これらのモデルをリソースに制約されたモバイルデバイスにデプロイするには、計算とメモリのオーバーヘッドが禁じられる。本稿では,空間圧縮比とDiTブロック深さを調整することで,このダイナミックトレードオフを実現するElasticDiTを紹介する。
参考スコア（独自算出の注目度）: 19.789749822094617
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Diffusion Transformer (DiT) architecture is the state-of-the-art paradigm for high-fidelity image generation, underpinning models like Stable Diffusion-3 and FLUX.1. However, deploying these models on resource-constrained mobile devices entails prohibitive computational and memory overhead. While efficiency-driven approaches like Linear-DiT and static pruning alleviate bottlenecks, they often incur quality degradation. Unlike cloud environments, mobile constraints require a single-model paradigm that dynamically balances fidelity and latency. We introduce ElasticDiT, which achieves this dynamic trade-off by adjusting spatial compression ratios and DiT block depths. By integrating Shift Sparse Block Attention (SSBA) and a Tiny DWT-Distilled VAE (T-DVAE), ElasticDiT reduces inference latency and memory footprint while maintaining image quality. Experiments confirm that ElasticDiT effectively covers a wide range of fidelity-latency trade-offs within a single set of parameters. By jointly adjusting compression and depth, a single ElasticDiT model can be reconfigured on-the-fly to outperform task-specific baselines. Specifically, our flex lite variant achieves an HPS of 32.87, surpassing the Flux model, while maintaining competitive quality at 84.16 percent average sparsity through SSBA. Furthermore, the plug-and-play T-DVAE provides SD3-level reconstruction with only 1/8x the computational cost of standard VAEs, and Flow-GRPO boosts semantic alignment (GenEval: 66.93 to 73.62). These results demonstrate that ElasticDiT offers a versatile, hardware-adaptive solution that eliminates the need for multiple specialized models, providing a promising path for future high-resolution image generation on mobile devices.
Abstract（参考訳）: Diffusion Transformer (DiT) アーキテクチャは、高忠実な画像生成のための最先端のパラダイムであり、Stable Diffusion-3やFLUX.1のような基盤となるモデルである。しかし、これらのモデルをリソースに制約されたモバイルデバイスにデプロイするには、計算とメモリのオーバーヘッドが禁じられる。 Linear-DiTや静的プルーニングといった効率駆動のアプローチではボトルネックが緩和されるが、品質劣化が頻繁に発生する。クラウド環境とは異なり、モバイルの制約は、フィデリティとレイテンシを動的にバランスするシングルモデルパラダイムを必要とする。本稿では,空間圧縮比とDiTブロック深さを調整することで,このダイナミックトレードオフを実現するElasticDiTを紹介する。 Shift Sparse Block Attention (SSBA)とTiny DWT-Distilled VAE (T-DVAE)を統合することで、ElasticDiTは画像品質を維持しながら、推論レイテンシとメモリフットプリントを削減する。実験では、ElasticDiTが単一のパラメータセット内で広範囲の忠実度-遅延トレードオフを効果的にカバーしていることを確認した。圧縮と深さの調整を共同で行うことで、単一ElasticDiTモデルをオンザフライで再構成して、タスク固有のベースラインを上回ります。具体的には、我々のフレキシブルライト変種は、Fluxモデルを上回る32.87のHPSを実現し、SSBAによる平均スパシティ84.6%の競争品質を維持した。さらに、プラグアンドプレイのT-DVAEは標準のVAEの計算コストのわずか1/8でSD3レベルの再構成を提供し、Flow-GRPOはセマンティックアライメントを高める(GenEval: 66.93から73.62)。これらの結果から,ElasticDiTは,複数の特殊なモデルの必要性を排除し,モバイルデバイス上での高解像度画像生成に期待できる,汎用的でハードウェア対応のソリューションを提供することが示された。

論文の概要: ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

関連論文リスト