Fugu-MT 論文翻訳(概要): CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs

論文の概要: CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs

arxiv url: http://arxiv.org/abs/2606.19927v1
Date: Thu, 18 Jun 2026 08:28:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-19 18:23:39.72838
Title: CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs
Title（参考訳）: CARE:ビデオMLLMにおける適応推論長に対するコンピテンス・アウェア・リワード・シェーピング
Authors: Chengwen Liu, Hao Peng, Jisheng Dang, Hong Peng, Bin Hu, Tat-Seng Chua,
Abstract要約: マルチモーダル推論における適応推論長最適化のための能力認識型報酬形成フレームワークであるCAREを提案する。 CAREは、パスレートの指数的な移動平均を通したスムーズなコンピテンス推定を維持し、それを訓練を進行段階にルートするために利用する。複数のビデオ推論と一般的なビデオ理解ベンチマークの実験により、CAREは推論精度を一貫して改善し、強化学習を安定化し、トークン効率を大幅に向上することを示した。
参考スコア（独自算出の注目度）: 50.189987475377656
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In multimodal video reasoning, reinforcement learning-based methods typically rely on simplistic and inflexible reasoning-length control strategies that fail to adapt to the model's evolving competence. This mismatch may suppress necessary exploration at early stages, while encouraging redundant reasoning and inefficient decoding once the model becomes more competent. In this paper, we propose CARE, a competence-aware reward shaping framework for adaptive reasoning length optimization in multimodal reasoning. Specifically, CARE maintains a smoothed competence estimate via an exponential moving average of pass rates, and uses it to route training into progressive stages that shift the reward preference from exploration-oriented long-form reasoning to efficiency-oriented concise reasoning. To avoid conflating verbosity with intrinsic task complexity, CARE further normalizes reasoning effort with batch-level statistics, and introduces a posterior amplifier to strengthen reward signals for unexpectedly strong performance on historically difficult samples. The proposed mechanism is seamlessly integrated into the GRPO training pipeline and incurs no additional inference-time overhead. Extensive experiments on multiple video reasoning and general video understanding benchmarks demonstrate that CARE consistently improves reasoning accuracy, stabilizes reinforcement learning, and significantly enhances token efficiency. Moreover, CARE exhibits a characteristic inverted-U trajectory of reasoning length during training, and yields shorter yet more informative reasoning traces at convergence, indicating effective adaptive allocation of reasoning budget. We provide the source code for our proposed CARE framework and experiments at https://github.com/1Pansy/Video-CARE.
Abstract（参考訳）: マルチモーダルビデオ推論では、強化学習に基づく手法は通常、モデルの進化する能力に適応できない単純で柔軟性のない推論長の制御戦略に依存している。このミスマッチは、モデルがより有能になると冗長な推論と非効率な復号を奨励しながら、初期の段階で必要な探索を抑制する可能性がある。本稿では,多モーダル推論における適応推論長最適化のための能力認識型報酬形成フレームワークであるCAREを提案する。特にCAREは、指数移動平均のパスレートを通したスムーズなコンピテンス推定を維持し、それを用いて、報酬の選好を探索指向のロングフォーム推論から効率指向の簡潔推論にシフトさせるプログレッシブステージにトレーニングをルーティングする。固有タスク複雑性と重複する冗長性を回避するため、CAREはバッチレベルの統計量による推論作業を標準化し、歴史的に困難なサンプルに対して予期せぬ強い性能を示す報酬信号を強化するための後続増幅器を導入する。提案するメカニズムはGRPOトレーニングパイプラインにシームレスに統合され、追加の推論時間オーバーヘッドは発生しない。複数のビデオ推論と一般的なビデオ理解ベンチマークに関する大規模な実験により、CAREは推論精度を一貫して改善し、強化学習を安定化し、トークン効率を大幅に向上することを示した。さらに、CAREは、トレーニング中の推論長の逆U軌道を特徴として示し、より短く、より情報的な推論トレースを収束時に生成し、推論予算の効果的な適応配置を示す。提案したCAREフレームワークのソースコードと、https://github.com/1Pansy/Video-CAREでの実験を行います。

論文の概要: CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs

関連論文リスト