Fugu-MT 論文翻訳(概要): Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios

論文の概要: Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios

arxiv url: http://arxiv.org/abs/2605.06185v1
Date: Thu, 07 May 2026 13:01:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.805073
Title: Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios
Title（参考訳）: Event-Causal RAG:複雑なシナリオにおけるロングビデオ推論のための検索拡張生成フレームワーク
Authors: Peizheng Yan, Yu Zhao, Liang Xie, Juntong Qi, Mingming Wang, Erwei Yin,
Abstract要約: Event-Causal RAGは、無限長ビデオ推論のための軽量な検索拡張フレームワークである。ストリーミングビデオを意味的に一貫性のあるイベントにセグメントし、各イベントを構造化されたステート-イベント-ステートグラフとして表現する。このメモリ上に、最も関連性の高いイベント因果連鎖を効率的に識別するための双方向検索戦略を設計する。
参考スコア（独自算出の注目度）: 9.729442664774988
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent large vision-language models have achieved strong performance on short- and medium-length video understanding, yet they remain inadequate for ultra-long or even infinite video reasoning, where models must preserve coherent memory over extended durations and infer causal dependencies across temporally distant events. Existing end-to-end video understanding methods are fundamentally limited by the $O(n^2)$ complexity of self-attention, while recent retrieval-augmented generation (RAG) approaches still suffer from fragmented clip-level memory, weak modeling of temporal and causal structure, and high storage and online inference costs. We present Event-Causal RAG, a lightweight retrieval-augmented framework for infinite long-video reasoning. Instead of indexing fixed-length clips, our method segments streaming videos into semantically coherent events and represents each event as a structured State-Event-State (SES) graph, capturing the event together with its surrounding state transitions. These graphs are merged into a global Event Knowledge Graph and stored in a dual-store memory that supports both semantic matching and causal-topological retrieval. On top of this memory, we design a bidirectional retrieval strategy to efficiently identify the most relevant event causal chains and provide them, together with the associated video evidence, to a backbone video foundation model for answer generation. Experiments on long-video understanding benchmarks demonstrate that Event-Causal RAG consistently outperforms strong clip-based retrieval baselines and long-context video models, particularly on questions requiring multi-event integration and causal inference across long temporal gaps, while also achieving improved memory efficiency and robust streaming performance.
Abstract（参考訳）: 近年の大規模視覚言語モデルは、短距離・中距離の映像理解において強力な性能を保っているが、超長大・無限大のビデオ推論には不適切であり、モデルが長期にわたってコヒーレントな記憶を保ち、時間的に離れた事象に因果的依存関係を推測しなければならない。既存のエンド・ツー・エンドの動画理解手法は、O(n^2)$の自己アテンションの複雑さによって基本的に制限されているが、最近の検索強化世代(RAG)アプローチは、まだ断片化されたクリップレベルのメモリ、時間的・因果構造の弱いモデリング、高ストレージとオンライン推論コストに悩まされている。無限長ビデオ推論のための軽量検索拡張フレームワークであるEvent-Causal RAGを提案する。固定長クリップをインデックス化する代わりに,ビデオのセグメンテーションを意味的に一貫性のあるイベントに分割し,各イベントを構造化されたステートイベント状態(SES)グラフとして表現し,その周辺の状態遷移とともにイベントをキャプチャする。これらのグラフはグローバルなイベント知識グラフにマージされ、セマンティックマッチングと因果トポロジー検索の両方をサポートするデュアルストアメモリに格納される。このメモリ上に、最も関連性の高いイベント因果連鎖を効率的に識別し、関連するビデオエビデンスとともに、応答生成のためのバックボーンビデオ基盤モデルに提供するための双方向検索戦略を設計する。ロングビデオ理解ベンチマークの実験では、Event-Causal RAGは、強いクリップベースの検索ベースラインとロングコンテキストビデオモデル、特に長時間の時間的ギャップをまたいだマルチイベント統合と因果推論を必要とする問題において、一貫してパフォーマンスが向上し、メモリ効率とロバストなストリーミング性能も向上している。

論文の概要: Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios

関連論文リスト