Fugu-MT 論文翻訳(概要): Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep

論文の概要: Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep

arxiv url: http://arxiv.org/abs/2603.24260v1
Date: Wed, 25 Mar 2026 12:53:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:11.296372
Title: Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep
Title（参考訳）: 不均一キャッシングによる拡散に基づくビデオ編集の高速化:サンプリングされたデノイング時間におけるフルコンピューティングを超えて
Authors: Tianyi Liu, Ye Lu, Linfeng Zhang, Chen Cai, Jianjun Gao, Yi Wang, Kim-Hui Yap, Lap-Pui Chau,
Abstract要約: HetCacheは、ビデオ・ツー・ビデオ(MV2V)の生成と編集のためのトレーニング不要な拡散加速フレームワークである。編集の一貫性と忠実さを維持しながら、冗長な注意操作を低減する。実験によると、HetCacheは2.67$times$レイテンシのスピードアップやFLOPの削減など、目立った加速を実現している。
参考スコア（独自算出の注目度）: 37.62908191585867
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion-based video editing has emerged as an important paradigm for high-quality and flexible content generation. However, despite their generality and strong modeling capacity, Diffusion Transformers (DiT) remain computationally expensive due to the iterative denoising process, posing challenges for practical deployment. Existing video diffusion acceleration methods primarily exploit denoising timestep-level feature reuse, which mitigates the redundancy in denoising process, but overlooks the architectural redundancy within the DiT that many attention operations over spatio-temporal tokens are redundantly executed, offering little to no incremental contribution to the model output. This work introduces HetCache, a training-free diffusion acceleration framework designed to exploit the inherent heterogeneity in diffusion-based masked video-to-video (MV2V) generation and editing. Instead of uniformly reuse or randomly sampling tokens, HetCache assesses the contextual relevance and interaction strength among various types of tokens in designated computing steps. Guided by spatial priors, it divides the spatial-temporal tokens in DiT model into context and generative tokens, and selectively caches the context tokens that exhibit the strongest correlation and most representative semantics with generative ones. This strategy reduces redundant attention operations while maintaining editing consistency and fidelity. Experiments show that HetCache achieves a noticeable acceleration, including a 2.67$\times$ latency speedup and FLOPs reduction over commonly used foundation models, with negligible degradation in editing quality.
Abstract（参考訳）: 拡散に基づくビデオ編集は、高品質で柔軟なコンテンツ生成の重要なパラダイムとして現れてきた。しかし、その一般化と強力なモデリング能力にもかかわらず、拡散変換器(DiT)は反復的デノナイジングプロセスのために計算コストがかかり、実用的展開の難しさを浮き彫りにしている。既存のビデオ拡散加速法は主に、デノナイジングプロセスにおける冗長性を緩和する時間ステップレベルの特徴再利用を利用するが、時空間トークンに対する多くの注意操作が冗長に実行されるため、モデル出力への漸進的な寄与はほとんど提供されないため、DiT内のアーキテクチャ上の冗長性を見落としている。 HetCacheは、拡散ベースのマスク付きビデオ・トゥ・ビデオ(MV2V)生成と編集において固有の不均一性を活用するために設計された、トレーニング不要な拡散加速フレームワークである。トークンを一様に再利用したり、ランダムにサンプリングする代わりに、HetCacheは指定されたコンピューティングステップにおいて、さまざまなタイプのトークン間のコンテキスト関連性と相互作用の強度を評価する。空間的先行性によって導かれ、DiTモデルの空間的時間的トークンを文脈的および生成的トークンに分割し、最も強い相関を示す文脈的トークンと生成的トークンとの最も代表的な意味論を選択的にキャッシュする。この戦略は、編集の一貫性と忠実さを維持しながら、冗長な注意操作を減らす。実験の結果、HetCacheは2.67$\times$レイテンシのスピードアップやFLOPの削減といった顕著な高速化を実現しており、編集品質の劣化は無視できることがわかった。

論文の概要: Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep

関連論文リスト