Fugu-MT 論文翻訳(概要): LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

論文の概要: LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

arxiv url: http://arxiv.org/abs/2606.02553v1
Date: Mon, 01 Jun 2026 17:50:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:32.554453
Title: LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation
Title（参考訳）: LongLive-RAG:ロングビデオ生成のための汎用検索フレームワーク
Authors: Qixin Hu, Shuai Yang, Wei Huang, Song Han, Yukang Chen,
Abstract要約: 自己回帰(AR)ビデオ拡散は可変長合成を可能にするが、長い水平生成は蓄積されたエラーやアイデンティティドリフトに悩まされることが多い。本稿では、長いビデオ生成を検索強化世代(RAG)問題として定式化し、この制限に対処する。本稿では,ARビデオ生成のための汎用検索フレームワークであるLongLive-RAGを提案する。
参考スコア（独自算出の注目度）: 28.243294694107288
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active window accumulates appearance errors, subsequent generations can only condition on this degraded trajectory and drift further away. We address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents. This lightweight retrieval step adds only a small overhead relative to generation and lets the generator condition on non-local context instead of only the recent window. To make retrieval more discriminative, we introduce the Window Temporal Delta Loss that suppresses redundant local similarity and encourages embeddings to capture meaningful temporal changes. Together, these components help reduce error accumulation caused by sliding-window attention. Experiments across multiple AR backbones and generation lengths show improved long-video quality and the best average VBench-Long rank. To our knowledge, among open-ended AR long video generation methods, LongLive-RAG is the first to formulate self-generated latent history as content-addressable retrieval memory. Code is available at https://github.com/qixinhu11/LongLive-RAG.
Abstract（参考訳）: 自己回帰(AR)ビデオ拡散は可変長合成を可能にするが、長い水平生成は蓄積されたエラーやアイデンティティドリフトに悩まされることが多い。効率性のために、既存の手法では、世代間スライディングウインドウの注意が一般的である。アクティブウィンドウが出現エラーを蓄積すると、その後の世代はこの劣化した軌道にのみ条件を定め、さらに遠くへ漂うことができる。本稿では、長いビデオ生成を検索強化世代(RAG)問題として定式化し、この制限に対処する。最近のウィンドウにのみ依存するのではなく、以前生成された潜伏語を動的で検索可能な歴史として扱う。本稿では,ARビデオ生成のための汎用検索フレームワークであるLongLive-RAGを提案する。新しいブロック毎に、LongLive-RAGはクエリ埋め込みを使用して関連する履歴ラテントを検索する。この軽量な検索ステップでは、生成に対するオーバーヘッドが小さくなり、最近のウィンドウのみではなく、ローカルでないコンテキストでジェネレータの状態が保証される。検索をより差別化するために、冗長な局所的類似性を抑え、意味のある時間的変化を捉えるために埋め込みを奨励するウィンドウ時間デルタ損失を導入する。これらのコンポーネントは、スライディングウインドウの注意によるエラーの蓄積を低減するのに役立ちます。複数のARバックボーンと生成長にわたる実験では、ビデオの画質が向上し、VBench-Longの最高ランクが向上した。我々の知る限り、オープンなARロングビデオ生成手法の中で、LongLive-RAGはコンテンツ適応型検索メモリとして自己生成潜在履歴を定式化した最初のものである。コードはhttps://github.com/qixinhu11/LongLive-RAGで公開されている。

論文の概要: LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

関連論文リスト