Fugu-MT 論文翻訳(概要): Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding

論文の概要: Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding

arxiv url: http://arxiv.org/abs/2603.17307v1
Date: Wed, 18 Mar 2026 03:04:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.490113
Title: Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding
Title（参考訳）: Symphony:ロングビデオ理解のための認知型マルチエージェントシステム
Authors: Haiyang Yan, Hongyun Zhou, Peng Xu, Xiaoxue Feng, Mengyi Liu,
Abstract要約: ロングフォームビデオ理解(LVU)タスクは、高情報密度と拡張時空間によって特徴づけられる。 LVUエージェントに関する最近の研究は、単純なタスク分解と協調機構がLVUタスクには不十分であることを実証している。我々は,LVUを細粒度サブタスクに分解し,深い推論協調機構を組み込んだマルチエージェントシステムであるSymphonyを提案する。
参考スコア（独自算出の注目度）: 5.981841802050151
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite rapid developments and widespread applications of MLLM agents, they still struggle with long-form video understanding (LVU) tasks, which are characterized by high information density and extended temporal spans. Recent research on LVU agents demonstrates that simple task decomposition and collaboration mechanisms are insufficient for long-chain reasoning tasks. Moreover, directly reducing the time context through embedding-based retrieval may lose key information of complex problems. In this paper, we propose Symphony, a multi-agent system, to alleviate these limitations. By emulating human cognition patterns, Symphony decomposes LVU into fine-grained subtasks and incorporates a deep reasoning collaboration mechanism enhanced by reflection, effectively improving the reasoning capability. Additionally, Symphony provides a VLM-based grounding approach to analyze LVU tasks and assess the relevance of video segments, which significantly enhances the ability to locate complex problems with implicit intentions and large temporal spans. Experimental results show that Symphony achieves state-of-the-art performance on LVBench, LongVideoBench, VideoMME, and MLVU, with a 5.0% improvement over the prior state-of-the-art method on LVBench. Code is available at https://github.com/Haiyang0226/Symphony.
Abstract（参考訳）: MLLMエージェントの急速な開発と広範囲の応用にもかかわらず、彼らは高情報密度と時空間の拡張を特徴とするLVU(Long-form Video Understanding)タスクに苦慮している。 LVUエージェントに関する最近の研究は、単純なタスクの分解と協調機構が長鎖推論タスクには不十分であることを実証している。さらに,組込み型検索による時間文脈の直接的低減は,複雑な問題の鍵となる情報を失う可能性がある。本稿では,これらの制約を緩和するマルチエージェントシステムであるSymphonyを提案する。人間の認知パターンをエミュレートすることで、SymphonyはLVUを微細なサブタスクに分解し、リフレクションによって強化された深い推論協調機構を組み込み、推論能力を効果的に改善する。さらに、SymphonyはVLMベースのグラウンド方式でLVUタスクを分析し、ビデオセグメントの関連性を評価する。実験の結果,SymphonyはLVBench,LongVideoBench,VideoMME,MLVUに対して,従来のLVBenchよりも5.0%改善した。コードはhttps://github.com/Haiyang0226/Symphony.comで入手できる。

論文の概要: Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding

関連論文リスト