Fugu-MT 論文翻訳(概要): MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

論文の概要: MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

arxiv url: http://arxiv.org/abs/2606.07512v1
Date: Fri, 05 Jun 2026 17:59:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.886883
Title: MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism
Title（参考訳）: MemDreamer:階層的なグラフ記憶とエージェント検索機構による長いビデオ理解のための認識と推論の分離
Authors: Cong Chen, Guo Gan, Kaixiang Ji, ChaoYang Zhang, Zhen Yang, Guangming Yao, Hao Chen, Jingdong Chen, Yi Yuan, Chunhua Shen,
Abstract要約: 現在のVision-Language Modelsは、フル長のビジュアルシーケンスを処理することによって、禁止されたトークンの爆発と注意の希釈を引き起こすため、数時間のビデオに苦労している。我々はMemDreamerを導入し、知覚と推論を分離し、長いビデオ理解をエージェント探索プロセスに移行する。 MemDreamerは4つの主要なベンチマークでSOTAの結果を達成し、人間の専門家とのギャップをわずか3.7ポイントに縮める。
参考スコア（独自算出の注目度）: 70.69809410471993
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.
Abstract（参考訳）: 現在のVision-Language Modelsは、フル長のビジュアルシーケンスを処理することによって、禁止されたトークンの爆発と注意の希釈を引き起こすため、数時間のビデオに苦労している。そこで我々は,MemDreamerを導入して認識と推論を分離し,長いビデオ理解をエージェント探索プロセスに移行する。プラグイン・アンド・プレイのフレームワークとして、ビデオストリームをインクリメンタルにストリームして階層グラフメモリを構築する。これはセマンティック抽象化のためのトップダウンの3層アーキテクチャで、時空間と因果関係をキャプチャする基礎グラフに固定されている。推論の間、推論モデルはエージェントツール拡張検索、階層のナビゲート、ノードの探索、およびオブザーバ・レーソン・アクション・ループを介して論理的エッジをトラバースする。実験では、MemDreamerは4つの主要なベンチマークでSOTAの結果を達成し、人間の専門家とのギャップをわずか3.7ポイントに縮めた。推論コンテキストウィンドウを12.5ポイントの絶対精度のゲインを提供しながら、フルコンテキストの取り込みの2%に制限する。さらに,論理的推論におけるVLMの性能と長ビデオ理解ベンチマークとの間には強い正の線形相関関係が明らかとなり,マルチモーダル理解のための新しいパラダイムとしてエージェント能力スケーリングが確立された。

論文の概要: MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

関連論文リスト