Fugu-MT 論文翻訳(概要): ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

論文の概要: ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

arxiv url: http://arxiv.org/abs/2603.25746v1
Date: Thu, 26 Mar 2026 17:59:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.437591
Title: ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling
Title（参考訳）: ShotStream:インタラクティブなストーリーテリングのためのマルチショットビデオ生成
Authors: Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, Tianfan Xue,
Abstract要約: ShotStreamはインタラクティブなストーリーテリングを可能にする新しい因果的マルチショットアーキテクチャである。サブ秒のレイテンシでコヒーレントなマルチショットビデオを生成し、1つのGPUで16 FPSを達成する。
参考スコア（独自算出の注目度）: 31.758254551463406
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two key innovations. First, a dual-cache memory mechanism preserves visual coherence: a global context cache retains conditional frames for inter-shot consistency, while a local context cache holds generated frames within the current shot for intra-shot consistency. And a RoPE discontinuity indicator is employed to explicitly distinguish the two caches to eliminate ambiguity. Second, to mitigate error accumulation, we propose a two-stage distillation strategy. This begins with intra-shot self-forcing conditioned on ground-truth historical shots and progressively extends to inter-shot self-forcing using self-generated histories, effectively bridging the train-test gap. Extensive experiments demonstrate that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models, paving the way for real-time interactive storytelling. Training and inference code, as well as the models, are available on our
Abstract（参考訳）: 長いストーリーテリングにはマルチショットビデオ生成が不可欠だが、現在の双方向アーキテクチャでは対話性やレイテンシが制限されている。インタラクティブなストーリーテリングと効率的なオンザフライフレーム生成が可能な,因果的マルチショットアーキテクチャであるShotStreamを提案する。 ShotStreamは、タスクを歴史的状況に応じて次のショット生成として再構築することで、ユーザはストリーミングプロンプトを通じて進行中の物語を動的に指示することができる。まず、テキスト・ビデオ・モデルを双方向の次ショット・ジェネレータに微調整し、次に分散マッチング蒸留を用いて因果学生に蒸留する。自己回帰生成に固有のショット間一貫性とエラー蓄積の課題を克服するために,2つの重要なイノベーションを紹介した。グローバルコンテキストキャッシュは、ショット間の一貫性のために条件付きフレームを保持し、ローカルコンテキストキャッシュは、ショット内の一貫性のために、現在のショット内で生成されたフレームを保持する。そして、2つのキャッシュを明確に区別するために、RoPE不連続性インジケータを使用し、あいまいさを排除します。次に, 2段階蒸留方式を提案する。これは、地味な歴史的ショットに条件付けされたショット内セルフフォースから始まり、徐々に自己生成履歴を使ってショット間セルフフォースに拡張され、効果的に列車とテストの間のギャップを埋める。大規模な実験により、ShotStreamは1つのGPU上で16FPSを達成した、秒以下のレイテンシでコヒーレントなマルチショットビデオを生成することが実証された。それは、より遅い双方向モデルの品質と一致し、リアルタイムのインタラクティブなストーリーテリングの道を開く。トレーニングと推論のコードとモデルが、当社で利用可能です。

論文の概要: ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

関連論文リスト