Fugu-MT 論文翻訳(概要): VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

論文の概要: VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

arxiv url: http://arxiv.org/abs/2511.19524v1
Date: Mon, 24 Nov 2025 07:04:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-26 17:37:04.072001
Title: VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning
Title（参考訳）: VideoChat-M1:マルチエージェント強化学習による映像理解のための協調的政策計画
Authors: Boyu Chen, Zikang Wang, Zhengrong Yue, Kainan Yan, Chenyun Yu, Yi Huang, Zijun Liu, Yafei Wen, Xiaoxin Chen, Yang Liu, Peng Li, Yali Wang,
Abstract要約: 本稿では,ビデオ理解のための新しいマルチエージェントシステムであるVideoChat-M1を提案する。単一のポリシーや固定されたポリシーを使う代わりに、VideoChat-M1は複数のポリシーエージェントを持つCPP(Collaborative Policy Planning)パラダイムを採用する。我々は,ビデオチャット-M1が4つのタスクにまたがる8つのベンチマークでSOTA性能を達成することを示す。
参考スコア（独自算出の注目度）: 30.278740496355507
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: By leveraging tool-augmented Multimodal Large Language Models (MLLMs), multi-agent frameworks are driving progress in video understanding. However, most of them adopt static and non-learnable tool invocation mechanisms, which limit the discovery of diverse clues essential for robust perception and reasoning regarding temporally or spatially complex videos. To address this challenge, we propose a novel Multi-agent system for video understanding, namely VideoChat-M1. Instead of using a single or fixed policy, VideoChat-M1 adopts a distinct Collaborative Policy Planning (CPP) paradigm with multiple policy agents, which comprises three key processes. (1) Policy Generation: Each agent generates its unique tool invocation policy tailored to the user's query; (2) Policy Execution: Each agent sequentially invokes relevant tools to execute its policy and explore the video content; (3) Policy Communication: During the intermediate stages of policy execution, agents interact with one another to update their respective policies. Through this collaborative framework, all agents work in tandem, dynamically refining their preferred policies based on contextual insights from peers to effectively respond to the user's query. Moreover, we equip our CPP paradigm with a concise Multi-Agent Reinforcement Learning (MARL) method. Consequently, the team of policy agents can be jointly optimized to enhance VideoChat-M1's performance, guided by both the final answer reward and intermediate collaborative process feedback. Extensive experiments demonstrate that VideoChat-M1 achieves SOTA performance across eight benchmarks spanning four tasks. Notably, on LongVideoBench, our method outperforms the SOTA model Gemini 2.5 pro by 3.6% and GPT-4o by 15.6%.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)を活用することで、マルチエージェントフレームワークは、ビデオ理解の進歩を推進している。しかし、そのほとんどは静的で学習不可能なツール呼び出し機構を採用しており、これは時間的または空間的に複雑なビデオに関する堅牢な認識と推論に不可欠な多様な手がかりの発見を制限する。この課題に対処するために,ビデオ理解のための新しいマルチエージェントシステムであるVideoChat-M1を提案する。単一のポリシーや固定されたポリシーを使う代わりに、VideoChat-M1は3つの主要なプロセスから構成される複数のポリシーエージェントを備えたCPP(Collaborative Policy Planning)パラダイムを採用する。 1)ポリシー生成:各エージェントは,ユーザのクエリに合わせて独自のツール呼び出しポリシーを生成する。(2)ポリシー実行:各エージェントは,そのポリシーを実行し,ビデオコンテンツを探索するための関連ツールを順次呼び出す。(3)ポリシー通信:政策実行の中間段階において,エージェントは互いに対話し,それぞれのポリシーを更新する。このコラボレーティブフレームワークを通じて、すべてのエージェントがタンデムで働き、ユーザのクエリに効果的に対応するために、ピアからのコンテキスト的洞察に基づいて、好みのポリシーを動的に洗練する。さらに,我々はCPPパラダイムを,簡潔なマルチエージェント強化学習(MARL)手法で実現している。その結果、ポリシエージェントのチームは、最終回答報酬と中間協調プロセスフィードバックの両方によって導かれる、VideoChat-M1のパフォーマンスを向上させるために、共同で最適化することができる。大規模な実験により、VideoChat-M1は4つのタスクにまたがる8つのベンチマークでSOTAのパフォーマンスを達成した。特にLongVideoBenchでは,SOTAモデルであるGemini 2.5 Proを3.6%,GPT-4oを15.6%上回る性能を示した。

論文の概要: VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

関連論文リスト