Fugu-MT 論文翻訳(概要): Bridging Modalities, Spanning Time: Structured Memory for Ultra-Long Agentic Video Reasoning

論文の概要: Bridging Modalities, Spanning Time: Structured Memory for Ultra-Long Agentic Video Reasoning

arxiv url: http://arxiv.org/abs/2605.08271v1
Date: Fri, 08 May 2026 03:21:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.518582
Title: Bridging Modalities, Spanning Time: Structured Memory for Ultra-Long Agentic Video Reasoning
Title（参考訳）: ブリジングモードとスパンニングタイム:超長距離エージェントビデオ再生のための構造化記憶
Authors: Jiazheng Li, Chi-Hao Wu, Yunze Liu, Kaize Ding, Jundong Li, Chuxu Zhang,
Abstract要約: MAGIC-Videoは、インターリーブされた物語チェーンを備えたマルチモーダルメモリグラフを中心に構築されたフレームワークである。 EgoLifeQA、Ego-R1、MM-Lifelongでは、MAGIC-Videoは一貫して、強力な汎用、長期ビデオ、エージェントベースラインを上回っている。
参考スコア（独自算出の注目度）: 82.97398529552166
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding ultra-long videos such as egocentric recordings, live streams, or surveillance footage spanning days to weeks, remains a challenge. For current multimodal LLMs: even with million-token context windows, frame budgets cover only tens of minutes of densely sampled video, and most evidence is discarded before inference begins. Memory-augmented and agentic approaches help with scale, but their retrieval remains fragmented across modalities and lacks long-range narrative summaries that span days or weeks. We propose \textbf{MAGIC-Video}, a training-free framework built around a multimodal memory graph with interleaved narrative chain: the graph unifies episodic, semantic, and visual content through six typed edges and supports cross-modal retrieval, while the chain distils long-horizon entity biographies and recurring activity events. At inference time, an agentic loop interleaves graph retrieval with narrative fact injection, covering both the modality and time dimensions of ultra-long video in a single retrieval pipeline. On EgoLifeQA, Ego-R1 and MM-Lifelong, MAGIC-Video consistently outperforms strong general-purpose, long-video, and agentic baselines, with gains of 10.1, 7.4, and 5.9 points over the prior best agentic system on each benchmark. Code is available at https://github.com/lijiazheng0917/MAGIC-video.
Abstract（参考訳）: エゴセントリックな録画やライブストリーム、数日から数週間にわたる監視映像などの超長いビデオを理解することは、依然として難しい課題だ。現在のマルチモーダルLCMでは、100万のコンテキストウインドウであっても、フレーム予算は密集したサンプルビデオのほんの数分間しかカバーせず、ほとんどの証拠は推論が始まる前に破棄される。メモリ拡張とエージェントアプローチはスケールに役立つが、その検索はモダリティによって断片化され、数日や数週間にわたる長い物語の要約が欠如している。グラフは6つの型付きエッジを通じてエピソード、セマンティック、視覚的コンテンツを統一し、クロスモーダル検索をサポートし、チェーンは長軸のエンティティのバイオグラフィーと繰り返しの活動イベントを排除し、マルチモーダルなメモリグラフを中心に構築されたトレーニング不要のフレームワークである。推論時には、エージェントループがグラフ検索と物語的事実注入をインターリーブし、単一の検索パイプラインにおける超長ビデオのモダリティと時間次元の両方をカバーする。 EgoLifeQA、Ego-R1、MM-Lifelongでは、MAGIC-Videoは、各ベンチマークで上位のエージェントシステムよりも10.1、7.4、および5.9ポイント向上し、強力な汎用、長ビデオ、エージェントベースラインを一貫して上回っている。コードはhttps://github.com/lijiazheng0917/MAGIC-videoで公開されている。

論文の概要: Bridging Modalities, Spanning Time: Structured Memory for Ultra-Long Agentic Video Reasoning

関連論文リスト