Fugu-MT 論文翻訳(概要): Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

論文の概要: Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

arxiv url: http://arxiv.org/abs/2604.08476v1
Date: Thu, 09 Apr 2026 17:15:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:06.042796
Title: Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization
Title（参考訳）: Faithful GRPO:制約付きポリシー最適化によるマルチモーダル言語モデルにおける視覚空間推論の改善
Authors: Sai Srinivas Kancheti, Aditya Kanade, Rohit Sinha, Vineeth N Balasubramanian, Tanuja Ganu,
Abstract要約: Chain-of-Thoughtトレースは、最終的な回答と矛盾することが多く、視覚的証拠が不十分である。ラグランジアン二重昇華による制約として整合性と接地を強制するFithful GRPOを提案する。その結果,FGRPOは推論品質を大幅に改善し,不整合率を24.5%から1.7%に下げ,視覚的接点スコアを+13%改善した。
参考スコア（独自算出の注目度）: 31.411469692692766
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: "logical consistency" (does the CoT entail the final answer?) and "visual grounding" (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.
Abstract（参考訳）: RLVRを用いた強化学習によるマルチモーダル推論モデル(MRM)は、視覚的推論ベンチマークにおける精度の向上を示す。生成したCoT(Chain-of-Thought)トレースは最終回答と矛盾することが多く、視覚的証拠が不十分である。我々は,この現象を実世界の空間的推論ベンチマーク7つで体系的に研究し,VoGoRL-SpatialやTreeVGRといった現代MRMや,標準グループ相対ポリシー最適化(GRPO)で訓練された我々のモデルに影響を及ぼすことを示した。我々は、CoT推論の品質を、2つの相補的な軸に沿って特徴付けている:「論理的一貫性」(CoTは最終回答を含んでいるか?)と「視覚的基礎」(各推論ステップは、画像内のオブジェクト、属性、空間的関係を正確に記述しているか?)。これを解決するために、ラグランジアン双対昇華による制約として一貫性と接地を強制するGRPOの変種であるFithful GRPO(FGRPO)を提案する。 FGRPOは、バッチレベルの一貫性とグラウンディング制約をグループ内の有利な計算に組み込み、最適化中の制約の相対的重要性を適応的に調整する。 7つの空間データセットにわたるQwen2.5-VL-7Bと3Bのバックボーン上でFGRPOを評価する。その結果,FGRPOは推論品質を大幅に改善し,不整合率を24.5%から1.7%に下げ,視覚的接点スコアを+13%改善した。また、単純なGRPOよりも最終的な回答精度を改善し、忠実な推論がより良い回答を可能にすることを示す。

論文の概要: Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

関連論文リスト