Fugu-MT 論文翻訳(概要): EasyVideoR1: Easier RL for Video Understanding

論文の概要: EasyVideoR1: Easier RL for Video Understanding

arxiv url: http://arxiv.org/abs/2604.16893v1
Date: Sat, 18 Apr 2026 07:56:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.222759
Title: EasyVideoR1: Easier RL for Video Understanding
Title（参考訳）: EasyVideoR1: ビデオ理解を容易にするRL
Authors: Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, Jiaqi Wang,
Abstract要約: 検証可能な報酬(RLVR)からの強化学習は,大規模言語モデルの推論能力向上に顕著な効果を示した。 EasyVideoR1は、ビデオ理解タスクで大きな視覚言語モデルをトレーニングするために特別に設計された、完全かつ効率的な強化学習フレームワークである。
参考スコア（独自算出の注目度）: 51.760544033045726
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present \textbf{EasyVideoR1}, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 $\times$ throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.
Abstract（参考訳）: 検証可能な報酬(RLVR)からの強化学習は,大規模言語モデルの推論能力向上に顕著な効果を示した。モデルがネイティブなマルチモーダルアーキテクチャへと進化するにつれて、RLVRをビデオ理解に拡張することがますます重要になるが、ビデオタスクタイプの多様性、高次元の視覚入力を繰り返し復号・前処理することの計算オーバーヘッド、多くの感度なハイパーパラメーター間で再現可能な評価の難しさなど、ほとんど未解明のままである。既存のオープンソースのRLトレーニングフレームワークは、テキストとイメージのシナリオのためのしっかりとした基盤を提供するが、ビデオのモダリティに適した体系的な最適化は欠如している。本研究では,ビデオ理解タスクにおいて,大規模視覚言語モデルの学習を目的とした,完全かつ効率的な強化学習フレームワークである「textbf{EasyVideoR1}」を提案する。 EasyVideoR1は、(1)オフライン前処理とテンソルキャッシュを備えたフルビデオRLトレーニングパイプラインで、冗長なビデオデコーディングを排除し、スループットの改善を1.47$\times$にする、(2)一貫したルーティングとモジュラー拡張を備えた、11の異なるビデオおよびイメージ問題タイプをカバーする包括的なタスク対応報酬システム、(3)高品質なトラジェクトリとより困難なタスクの学習を融合したオフラインオンラインデータトレーニングパラダイム、(4)独立に構成可能なピクセル予算による共同イメージビデオトレーニングにより、相互に強化できる2つのモードの相互強化を可能にする、(5)22の動画ベンチマークをカバーする非同期マルチベンチマーク評価フレームワーク。

論文の概要: EasyVideoR1: Easier RL for Video Understanding

関連論文リスト