Fugu-MT 論文翻訳(概要): Beyond Consistency: Preserving Temporal Structure in Zero-Shot Video Editing

論文の概要: Beyond Consistency: Preserving Temporal Structure in Zero-Shot Video Editing

arxiv url: http://arxiv.org/abs/2606.08780v1
Date: Sun, 07 Jun 2026 18:53:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.44416
Title: Beyond Consistency: Preserving Temporal Structure in Zero-Shot Video Editing
Title（参考訳）: 一貫性を超えて:ゼロショットビデオ編集における時間構造を保存する
Authors: Deyin Liu, Yisheng Ding, Zhe Jin, Xiatian Zhu, Anjan Dutta, Lin Wu,
Abstract要約: 既存のゼロショットビデオ編集手法では、ビデオの本来の時間構造を保存できない。本稿では,映像の時間的構造を保存することに焦点を当てた新しいゼロショット編集手法を提案する。提案手法は,従来の時間構造保存と計算効率のバランスをとることによって,最先端の計算結果を実現する。
参考スコア（独自算出の注目度）: 48.768011768488584
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing zero-shot video editing methods rely on pre-trained diffusion models, successfully achieving spatial control and basic temporal consistency but fundamentally fail to preserve the video's original temporal structure.This distinction is critical: temporal consistency ensures visual smoothness, but temporal structure dictates the video's high-level narrative, rhythm, and semantic flow. Without this preservation, the edited output, especially for long videos with complex semantic variations, becomes narratively incoherent and semantically ambiguous. To address this limitation, we introduce a novel zero-shot editing approach that, for the first time, explicitly focuses on preserving the source video's temporal structure. We achieve this by adaptively partitioning the video into semantically distinct clips based on feature similarity and selecting a representative anchor frame for each clip. To enhance both intra-clip fidelity and computational efficiency, we design a clip-adaptive token merging strategy which leverages the anchor's semantic dominance to stabilize the editing. Furthermore, we employ an alternating combination strategy that ensures seamless inter-clip transitions while maintaining semantic distinction. Extensive experiments demonstrate that our method achieves state-of-the-art results, successfully balancing the preservation of original temporal structure with computational efficiency, and setting a new benchmark for zero-shot video editing fidelity.
Abstract（参考訳）: 既存のゼロショットビデオ編集法は、事前訓練された拡散モデルに依存し、空間的制御と基本的な時間的整合性を達成するが、基本的にはビデオの本来の時間的構造を維持できない。この保存がなければ、編集されたアウトプット、特に複雑な意味的バリエーションを持つ長いビデオは、物語的に一貫性がなく、意味的に曖昧になる。この制限に対処するために、我々は、ソースビデオの時間的構造を明示的に保存することに焦点を当てた、新しいゼロショット編集アプローチを導入する。特徴的類似性に基づいて,動画を意味的に異なるクリップに適応的に分割し,各クリップに対して代表アンカーフレームを選択することで,これを実現する。クリップ内忠実度と計算効率を両立させるため,アンカーのセマンティック優位性を活用して編集を安定させるクリップ適応型トークンマージ戦略を設計する。さらに、セマンティックな区別を維持しつつ、シームレスなクリック間遷移を保証する交互組み合わせ戦略を採用する。広範にわたる実験により,本手法は最先端の結果を達成し,元の時間構造保存と計算効率のバランスを保ち,ゼロショット映像編集の精度向上のための新しいベンチマークを設定した。

論文の概要: Beyond Consistency: Preserving Temporal Structure in Zero-Shot Video Editing

関連論文リスト