Fugu-MT 論文翻訳(概要): All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

論文の概要: All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

arxiv url: http://arxiv.org/abs/2604.00479v1
Date: Wed, 01 Apr 2026 04:52:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-02 16:44:31.83757
Title: All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models
Title（参考訳）: すべての道がローマに通じる - ビジョンランゲージモデルにおけるダイバージェント思考のインセンティブ
Authors: Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Peter Tu, Jing Zhang,
Abstract要約: 強化学習(Reinforcement Learning, RL)は、視覚言語モデル(VLM)の推論能力を高める。 GRPOは多様性の崩壊を起こしやすいことを示し、モデルが早期に推論戦略の限られたサブセットに収束することを示した。マルチグループ政策最適化(MUPO: Multi-Group Policy Optimization)は,複数のソリューションにまたがる多元的思考の動機付けを目的とした,シンプルかつ効果的なアプローチである。
参考スコア（独自算出の注目度）: 19.093820590411266
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks. Project page: https://xytian1008.github.io/MUPO/
Abstract（参考訳）: 近年、強化学習(Reinforcement Learning, RL)、特にGRPO(Group Relative Policy Optimization, Group Relative Policy Optimization, GRPO)が、視覚言語モデル(VLM)の推論能力を内在的に引き起こし、向上させることが示されている。しかし、その約束にもかかわらず、RLモデルの有効性とそれらの制限を駆動する基盤となるメカニズムは、まだ未解明のままである。本稿では、RLモデルとベースモデルとの基本的な行動的差異を強調し、前者がより深く、狭義の推論を行う一方、基本モデルは、個々の経路に沿って洗練されていないにもかかわらず、より広範で多様な思考パターンを示す。トレーニングダイナミクスのさらなる解析を通じて,GRPOは多様性の崩壊を招き,モデルが推論戦略の限られたサブセットに早急に収束し,潜在的な選択肢の大多数を放棄し,局所最適性やスケーラビリティの低下につながることを示す。そこで本研究では,複数のソリューションにまたがる分散思考の動機付けを目的とした,シンプルかつ効果的なアプローチであるMulti-Group Policy Optimization (MUPO)を提案する。プロジェクトページ:https://xytian1008.github.io/MUPO/

論文の概要: All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

関連論文リスト