Fugu-MT 論文翻訳(概要): GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers

論文の概要: GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers

arxiv url: http://arxiv.org/abs/2512.03451v1
Date: Wed, 03 Dec 2025 05:08:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-04 20:02:55.131311
Title: GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers
Title（参考訳）: GalaxyDiT:拡散変換器の誘導アライメントと適応プロキシを用いた高能率映像生成
Authors: Zhiye Song, Steve Dai, Ben Keller, Brucek Khailany,
Abstract要約: GalaxyDiTは、ガイダンスアライメントと、再利用メトリクスのための体系的なプロキシ選択を備えた、ビデオ生成を高速化するトレーニング不要の方法である。我々は、Wan2.1-1.3BとWan2.1-14Bで1.87Times$と2.37times$のスピードアップを達成し、VBench-2.0ベンチマークでは0.97%と0.72%のダウンしか達成できなかった。提案手法は, ピーク信号-雑音比 (PSNR) において, 5～10dB の先行技術アプローチを上回り, ベースモデルに優れた忠実さを保っている。
参考スコア（独自算出の注目度）: 5.2424169748898555
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models have revolutionized video generation, becoming essential tools in creative content generation and physical simulation. Transformer-based architectures (DiTs) and classifier-free guidance (CFG) are two cornerstones of this success, enabling strong prompt adherence and realistic video quality. Despite their versatility and superior performance, these models require intensive computation. Each video generation requires dozens of iterative steps, and CFG doubles the required compute. This inefficiency hinders broader adoption in downstream applications. We introduce GalaxyDiT, a training-free method to accelerate video generation with guidance alignment and systematic proxy selection for reuse metrics. Through rank-order correlation analysis, our technique identifies the optimal proxy for each video model, across model families and parameter scales, thereby ensuring optimal computational reuse. We achieve $1.87\times$ and $2.37\times$ speedup on Wan2.1-1.3B and Wan2.1-14B with only 0.97% and 0.72% drops on the VBench-2.0 benchmark. At high speedup rates, our approach maintains superior fidelity to the base model, exceeding prior state-of-the-art approaches by 5 to 10 dB in peak signal-to-noise ratio (PSNR).
Abstract（参考訳）: 拡散モデルはビデオ生成に革命をもたらし、創造的なコンテンツ生成と物理シミュレーションに不可欠なツールとなった。 Transformer-based architectures (DiTs) と Classifier-free guidance (CFG) は、この成功の要点である。汎用性と優れた性能にもかかわらず、これらのモデルは集中的な計算を必要とする。各ビデオ生成には数十の反復ステップが必要で、CFGは必要な計算量を2倍にする。この非効率さは、下流アプリケーションで広く採用されるのを妨げる。トレーニング不要なビデオ生成手法であるGalaxyDiTを導入し,ガイダンスアライメントとシステマティックプロキシの選択を再利用メトリクスに適用する。ランク順相関解析により、モデルファミリとパラメータスケールをまたいだ各ビデオモデルに対して最適なプロキシを同定し、最適な計算再利用を実現する。 We achieve $1.87\times$ and $2.37\times$ speedup on Wan2.1-1.3B and Wan2.1-14B with only 0.97% and 0.72% drops on the VBench-2.0 benchmark。提案手法は, ピーク信号-雑音比(PSNR)において, 5～10dBの先行技術アプローチを上回り, ベースモデルに優れた忠実さを保っている。

論文の概要: GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers

関連論文リスト