Fugu-MT 論文翻訳(概要): EVA: Efficient Reinforcement Learning for End-to-End Video Agent

論文の概要: EVA: Efficient Reinforcement Learning for End-to-End Video Agent

arxiv url: http://arxiv.org/abs/2603.22918v1
Date: Tue, 24 Mar 2026 08:06:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.375379
Title: EVA: Efficient Reinforcement Learning for End-to-End Video Agent
Title（参考訳）: EVA: エンドツーエンドビデオエージェントの効率的な強化学習
Authors: Yaolun Zhang, Ruohui Wang, Jiahao Wang, Yepeng Tang, Xuanyu Zheng, Haonan Duan, Hao Lu, Hanming Deng, Lewei Lu,
Abstract要約: マルチモーダル大言語モデル(MLLM)によるビデオ理解は、ビデオの長いトークンシーケンスのため、依然として困難である。エンド・ツー・エンド・ビデオ・エージェントのための効率的な強化学習フレームワークであるEVAを提案する。 EVAは、何を見るか、いつ見るか、どのように見るかを自律的に決定し、クエリ駆動で効率的なビデオ理解を実現する。
参考スコア（独自算出の注目度）: 28.603844837930225
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Generalized Reward Policy Optimization (GRPO) - that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6-12% over general MLLM baselines and a further 1-3% gain over prior adaptive agent methods. Our code and model are available at https://github.com/wangruohui/EfficientVideoAgent.
Abstract（参考訳）: 多モーダル大言語モデル(MLLM)によるビデオ理解は、時間的依存と冗長なフレームを含むビデオの長いトークンシーケンスのため、依然として困難である。既存のアプローチでは、MLLMを受動的認識器として扱い、ビデオ全体や一様にサンプリングされたフレームを適応的推論なしで処理する。最近のエージェントベースの手法は外部ツールを導入しているが、手動で設計したワークフローや知覚優先の戦略に依存しており、長いビデオでは効率が良くない。本稿では,エンド・ツー・エンド・エンド・ビデオエージェントのための効率的な強化学習フレームワークであるEVAについて述べる。 EVAは、何を見るか、いつ見るか、どのように見るかを自律的に決定し、クエリ駆動で効率的なビデオ理解を実現する。このようなエージェントを訓練するために、教師付き微調整(SFT)、KTO(Kahneman-Tversky Optimization)、GRPO(Generalized Reward Policy Optimization)を含む単純な3段階学習パイプラインを設計した。さらに、各ステージごとに高品質なデータセットを構築し、安定かつ再現可能なトレーニングをサポートします。 EVAを6つのビデオ理解ベンチマークで評価し、その包括的能力を実証した。既存のベースラインと比較して、EVAは一般的なMLLMベースラインよりも6-12%大幅に改善され、事前適応剤法よりもさらに1-3%向上した。私たちのコードとモデルはhttps://github.com/wangruohui/EfficientVideoAgent.comで公開されています。

論文の概要: EVA: Efficient Reinforcement Learning for End-to-End Video Agent

関連論文リスト