Fugu-MT 論文翻訳(概要): Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory

論文の概要: Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory

arxiv url: http://arxiv.org/abs/2605.18733v1
Date: Mon, 18 May 2026 17:54:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:50.219958
Title: Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory
Title（参考訳）: 学習自由なアイデンティティ・アウェアメモリによるナラティブ・ロングビデオ生成の促進
Authors: Jinzhuo Liu, Jiangning Zhang, Wencan Jiang, Yabiao Wang, Dingkang Liang, Zhucun Xue, Ran Yi, Yong Liu,
Abstract要約: IAMFlowはトレーニング不要のID対応メモリフレームワークで、永続的なエンティティのIDを明示的にモデル化し追跡する。 VLMは、レンダリングフレームから属性を非同期に検証し、洗練し、暗黙の類似性ベースのマッチングの代わりに明示的なエンティティ追跡を可能にする。 NarraStream-Benchは,6次元にまたがる324のマルチプロンプトスクリプトを備えた,ナラストリームビデオ生成のためのベンチマークである。
参考スコア（独自算出の注目度）: 79.01059178883817
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive video generation has improved rapidly in visual fidelity and interactivity, but it still suffers from long-term inconsistency and memory degradation. Most existing solutions either compress historical frames using predefined strategies or retrieve keyframes based on coarse implicit attention signals, both of which fail to handle evolving prompts with shifting entity references, leading to identity drift, character duplication, and attribute loss. To address this, we propose IAMFlow, a training-free identity-aware memory framework that explicitly models and tracks persistent entity identities, enabling consistent generation across prompt transitions. Specifically, an LLM extracts entities with visual attributes from each prompt and assigns unique global IDs for identity-aware memory, while a VLM asynchronously verifies and refines attributes from rendered frames, enabling explicit entity tracking in place of implicit similarity-based matching. To keep the proposed framework computationally practical, we design a systematic inference acceleration pipeline, including asynchronous visual verification, adaptive prompt transition, and model quantization, which achieves faster generation than existing baselines. Furthermore, we introduce NarraStream-Bench, a benchmark for narrative streaming video generation that features 324 multi-prompt scripts spanning six dimensions and a three-dimensional evaluation protocol that integrates both traditional metrics and multimodal large language model-based assessments. Extensive experiments show that IAMFlow, despite being training-free, achieves the best overall performance on NarraStream-Bench, outperforming the strongest baseline by 2.56 points, while achieving a 1.39$\times$ speedup over the most efficient baseline in the 60-second multi-prompt setting.
Abstract（参考訳）: 自己回帰ビデオ生成は、視覚的忠実度と対話性において急速に改善されているが、それでも長期的不整合と記憶の劣化に悩まされている。既存のソリューションのほとんどは、事前に定義された戦略を使用して歴史的なフレームを圧縮するか、粗い注意信号に基づいてキーフレームを検索する。この問題を解決するために、トレーニング不要なID対応メモリフレームワークであるIAMFlowを提案し、永続的なエンティティのアイデンティティを明示的にモデル化し、追跡し、即時遷移をまたいで一貫した生成を可能にする。具体的には、LCMは各プロンプトから視覚的属性を持つエンティティを抽出し、ID対応メモリにユニークなグローバルIDを割り当てる一方、VLMはレンダリングフレームから属性を非同期に検証して精査し、暗黙的な類似性ベースのマッチングの代わりに明示的なエンティティ追跡を可能にする。提案するフレームワークを実用的なものにするために,非同期な視覚的検証,適応的なプロンプト遷移,モデル量子化など,既存のベースラインよりも高速な生成が可能な,体系的な推論加速パイプラインを設計する。さらに,ナラストリームベンチ(NarraStream-Bench)という,6次元にまたがる324のマルチプロンプトスクリプトと,従来のメトリクスとマルチモーダルな大規模言語モデルに基づく評価を統合した3次元評価プロトコルを紹介する。大規模な実験の結果、IAMFlowはトレーニングなしで、NarraStream-Benchで最高の全体的なパフォーマンスを達成し、最強のベースラインを2.56ポイント上回り、60秒のマルチプロンプト設定において最も効率的なベースラインよりも1.39$\times$スピードアップを達成した。

論文の概要: Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory

関連論文リスト