Fugu-MT 論文翻訳(概要): Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

論文の概要: Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

arxiv url: http://arxiv.org/abs/2603.22847v1
Date: Tue, 24 Mar 2026 06:38:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.336295
Title: Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought
Title（参考訳）: マルチモーダル・オブ・サートのためのトークンレベル政策最適化の再考
Authors: Yunheng Li, Hangyi Kuang, Hengrui Zhang, Jiangxia Cao, Zhaojie Liu, Qibin Hou, Ming-Ming Cheng,
Abstract要約: マルチモーダル・チェーン・オブ・ソート(CoT)推論は、推論軌道を構築するために大きな視覚言語モデルを必要とする。既存のReinforcement Learning with Verifiable Rewards (RLVR) 法は、様々な視覚的接地度を区別することなく、CoTを均一に扱う。本稿では,隠れ状態の類似性に先立って認識を導き,トークンのエントロピーと統合する知覚探索ポリシー最適化(PEPO)を提案する。
参考スコア（独自算出の注目度）: 73.39221516441624
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coarse granularity, treating CoT uniformly without distinguishing their varying degrees of visual grounding. In this work, we conduct a token-level analysis of multimodal reasoning trajectories and show that successful reasoning is characterized by structured token dynamics reflecting both perceptual grounding and exploratory inference. Building upon this analysis, we propose Perception-Exploration Policy Optimization (PEPO), which derives a perception prior from hidden state similarity and integrates it with token entropy through a smooth gating mechanism to produce token-level advantages. PEPO integrates seamlessly with existing RLVR frameworks such as GRPO and DAPO, requiring neither additional supervision nor auxiliary branches. Extensive experiments across diverse multimodal benchmarks demonstrate consistent and robust improvements over strong RL baselines, spanning geometry reasoning, visual grounding, visual puzzle solving, and few-shot classification, while maintaining stable training dynamics. Code: https://github.com/xzxxntxdy/PEPO
Abstract（参考訳）: マルチモーダル・チェーン・オブ・ソート(CoT)推論は、多段階推論で知覚的グラウンドをインターリーブする推論軌道を構築するために、大きな視覚言語モデルを必要とする。しかし、既存のReinforcement Learning with Verifiable Rewards (RLVR) 法は、通常、粗い粒度の推論を最適化し、CoTを様々な視覚的接地度を区別することなく均一に扱う。本研究では,マルチモーダル推論軌跡のトークンレベル解析を行い,有意な推論は,知覚的グラウンドと探索的推論の両方を反映した構造化トークンダイナミクスによって特徴づけられることを示す。この分析に基づいて,隠れ状態の類似性から先立って知覚を導き,円滑なゲーティング機構を通じてトークンエントロピーと統合し,トークンレベルの優位性を生み出す,知覚探索政策最適化(PEPO)を提案する。 PEPOはGRPOやDAPOといった既存のRLVRフレームワークとシームレスに統合され、追加の監視や補助的なブランチを必要としない。多様なマルチモーダルベンチマークにわたる広範囲な実験は、強力なRLベースライン、幾何学的推論、視覚的グラウンドリング、視覚パズルの解法、数ショットの分類にまたがる、一貫性と堅牢な改善を示しながら、安定したトレーニングダイナミクスを維持している。コード:https://github.com/xzxxntxdy/PEPO

論文の概要: Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

関連論文リスト