Fugu-MT 論文翻訳(概要): Spotlight on Token Perception for Multimodal Reinforcement Learning

論文の概要: Spotlight on Token Perception for Multimodal Reinforcement Learning

arxiv url: http://arxiv.org/abs/2510.09285v1
Date: Fri, 10 Oct 2025 11:25:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:48.799158
Title: Spotlight on Token Perception for Multimodal Reinforcement Learning
Title（参考訳）: マルチモーダル強化学習のためのトークン認識のスポットライト
Authors: Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, Yu Cheng,
Abstract要約: RLVR(Reinforcement Learning with Verifiable Rewards)は、LVLM(Large Vision-Language Models)の推論能力を向上した。本稿では,トークン認識の新しい視点を通して,マルチモーダルRLVRの先駆的な探索を行う。本稿では、トークン認識を明示的に活用して学習信号を洗練する新しいポリシー勾配アルゴリズムである視覚知覚政策最適化(VPPO)を提案する。
参考スコア（独自算出の注目度）: 65.97597482517425
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.
Abstract（参考訳）: Reinforcement Learning with Verifiable Rewards (RLVR)は、LVLM(Large Vision-Language Models)の推論能力を進歩させているが、既存の手法のほとんどは、RLVR最適化プロセスにおける視覚知覚の重要な役割を無視している。本稿では,各生成したトークンの視覚的依存性を測定するトークン認識の新しい視点を通じて,マルチモーダルRLVRの先駆的な探索を行う。 CoT(Chain-of-Thought)プロセスのきめ細かい分析により、まず、ロールアウト軌跡におけるトークンの認識がわずかに分散され、少数のトークンだけが視覚的に接地された推論に高い視覚的依存を持つ、という2つの重要な洞察が明らかになった。これらの観測に基づいて,トークン認識を明示的に活用して学習信号を洗練する新しいポリシー勾配アルゴリズムである視覚知覚政策最適化(VPPO)を提案する。具体的には、VPPOは2つのメカニズムによってこれを達成している: 全体的な視覚的依存によって、軌道の利点を再重み付けし、知覚的に重要なトークンのみにポリシー更新を集中する。 8つの知覚と推論のベンチマークからなる総合的なスイートにおいて、VPPOは7Bと32Bのモデルスケールでその有効性が一貫して検証され、主要なオープンソースRLチューニングモデルよりも大幅に向上することを示した。本研究は,マルチモーダルRLVR解析のための新しいトークンレベルの知覚視点を確立するだけでなく,LVLMのマルチモーダル推論能力を大幅に向上させる,新しい効果的な最適化戦略を提案する。

論文の概要: Spotlight on Token Perception for Multimodal Reinforcement Learning

関連論文リスト