Fugu-MT 論文翻訳(概要): PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

論文の概要: PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

arxiv url: http://arxiv.org/abs/2603.25730v1
Date: Thu, 26 Mar 2026 17:59:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.425828
Title: PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference
Title（参考訳）: PackForcing:ロングビデオサンプリングとロングコンテキスト推論のためのショートビデオトレーニング
Authors: Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, Kaipeng Zhang,
Abstract要約: PackForcingは、単一のH200 GPU上で16 FPSでコヒーレントな2分832x480ビデオを生成する。わずか4GBのバウンドKVキャッシュを実現し、ゼロショットまたは5秒のクリップでトレーニングされた24倍の時間(5秒から120秒)を効果的に動作させることができる。
参考スコア（独自算出の注目度）: 46.18482046594169
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-$k$ context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing
Abstract（参考訳）: 自己回帰的ビデオ拡散モデルでは顕著な進歩が見られるが、長いビデオ生成において、難易度の高い線形KV-cache成長、時間的反復、複合的エラーによってボトルネックが残っている。これらの課題に対処するために,新しい3部構成KV-cache戦略によって生成履歴を効率的に管理する統合フレームワークPackForcingを提案する。具体的には,(1)グローバルセマンティクスを維持するために初期アンカーフレームをフル解像度で保存するシンクトークン,(2)低解像度のVAE再符号化でプログレッシブ3Dコンボリューションを融合させるデュアルブランチネットワークを介して大規模な時空間圧縮(32倍のトークン削減)を実現するミッドトークン,(3)最近のトークンは局所的時間コヒーレンスを確保するためにフル解像度で保持されている。メモリフットプリントを品質を犠牲にすることなく厳密に拘束するために、中間トークンに対する動的トップ$kのコンテキスト選択機構と、ドロップトークンによる位置ギャップを無視できないオーバーヘッドでシームレスに調整する連続的テンポラルRoPE調整を導入する。 PackForcingは、この原則付き階層的コンテキスト圧縮によって、単一のH200 GPU上で16 FPSでコヒーレントな2分832x480ビデオを生成することができる。わずか4GBのバウンドKVキャッシュを実現し、24倍の時間外挿(5秒から120秒)を可能にする。 VBenchの広範な結果は、最先端の時間的一貫性 (26.07) と動的度 (56.25) を示し、短いビデオの監視は高品質で長時間のビデオ合成に十分であることを示した。 https://github.com/ShandaAI/PackForcing

論文の概要: PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

関連論文リスト