Fugu-MT 論文翻訳(概要): SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

論文の概要: SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

arxiv url: http://arxiv.org/abs/2604.05079v1
Date: Mon, 06 Apr 2026 18:30:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-08 17:42:09.439737
Title: SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
Title（参考訳）: SVAgent: クロスモーダルなマルチエージェントコラボレーションによるストーリーラインガイド付きロングビデオ理解
Authors: Zhongyu Yang, Zuhao Yang, Shuo Zhan, Tan Yue, Wei Pang, Yingfang Yuan,
Abstract要約: VideoQAは、ビデオシーケンスの複雑なダイナミクスをキャプチャするために、空間的、時間的、意味的な情報を統合する必要がある、困難なタスクである。本稿では,ビデオQAのためのストーリーライン誘導型クロスモーダルマルチエージェントフレームワークであるSVAgentを提案する。実験により,SVAgentは映像理解において人間のようなストーリーライン推論をエミュレートすることにより,優れた性能と解釈性を実現することが示された。
参考スコア（独自算出の注目度）: 6.451186120567798
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video question answering (VideoQA) is a challenging task that requires integrating spatial, temporal, and semantic information to capture the complex dynamics of video sequences. Although recent advances have introduced various approaches for video understanding, most existing methods still rely on locating relevant frames to answer questions rather than reasoning through the evolving storyline as humans do. Humans naturally interpret videos through coherent storylines, an ability that is crucial for making robust and contextually grounded predictions. To address this gap, we propose SVAgent, a storyline-guided cross-modal multi-agent framework for VideoQA. The storyline agent progressively constructs a narrative representation based on frames suggested by a refinement suggestion agent that analyzes historical failures. In addition, cross-modal decision agents independently predict answers from visual and textual modalities under the guidance of the evolving storyline. Their outputs are then evaluated by a meta-agent to align cross-modal predictions and enhance reasoning robustness and answer consistency. Experimental results demonstrate that SVAgent achieves superior performance and interpretability by emulating human-like storyline reasoning in video understanding.
Abstract（参考訳）: ビデオ質問応答(Video QA)は、ビデオシーケンスの複雑なダイナミクスを捉えるために、空間的、時間的、意味的な情報を統合する必要がある課題である。近年の進歩はビデオ理解に様々なアプローチを導入しているが、既存の手法の多くは、人間がしているように進化するストーリーラインを推論するのではなく、質問に答えるために関連するフレームの配置に依存している。人間は自然にコヒーレントなストーリーラインを通じてビデオを解釈する。このギャップに対処するため,ビデオQAのためのストーリーライン誘導型クロスモーダルマルチエージェントフレームワークであるSVAgentを提案する。ストーリーラインエージェントは、歴史的失敗を分析する洗練提案エージェントによって提案されたフレームに基づいて物語表現を段階的に構築する。さらに、クロスモーダルな意思決定エージェントは、進化するストーリーラインの指導の下で、視覚的およびテキスト的モダリティからの回答を独立して予測する。それらの出力はメタエージェントによって評価され、クロスモーダルな予測を整列させ、推論の堅牢性を高め、答えの整合性を高める。実験により,SVAgentは映像理解において人間のようなストーリーライン推論をエミュレートすることにより,優れた性能と解釈性を実現することが示された。

論文の概要: SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

関連論文リスト