Fugu-MT 論文翻訳(概要): Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

論文の概要: Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

arxiv url: http://arxiv.org/abs/2605.14054v1
Date: Wed, 13 May 2026 19:23:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.476934
Title: Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
Title（参考訳）: 悪い視線か悪い思考か : 視覚言語推論における知覚の逆転
Authors: Haozhe Wang, Qixin Xu, Changpeng Wang, Taofeng Xue, Chong Peng, Wenhu Chen, Fangzhen Lin,
Abstract要約: このトレードオフの根本原因は、モダリティクレジットの割り当ての曖昧さにあると我々は主張する。本稿では,知覚推論のシナジーを改善する強化学習フレームワークを提案する。
参考スコア（独自算出の注目度）: 45.319525299206866
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.
Abstract（参考訳）: 堅牢な知覚推論の相乗効果は、高度な視覚言語モデル(VLM)の中心的な目標である。近年の進歩は、アーキテクチャ設計やエージェントワークフローを通じてこの目標を追求している。しかしながら、これらのアプローチは静的なテキスト推論によって制限される場合や、外部エージェントの複雑さによる計算と工学の重荷によって複雑になる場合が多い。さらに悪いことに、この重い投資は比例的な利益をもたらしず、しばしば知覚と推論に対する「シーソー効果」を目撃する。これは真のボトルネックを根本的に再考する動機となっている。本稿では,このトレードオフの根本原因はモダリティ・クレジットの割り当ての曖昧さである,と論じる。VLMが失敗したとき,それは欠陥された知覚("悪い目")あるいは欠陥のある論理("悪い思考")によるのか? そこで本研究では,知覚の忠実さを確実に報い,知覚の相乗効果を向上させる強化学習フレームワークを提案する。我々は、生成過程をインターリーブされた知覚と推論ステップに明示的に分解する。この分離は、知覚の標的となる監督を可能にする。重要な点として,我々は,知覚検証(PV)を導入し,推論結果とは無関係に知覚の忠実さを報ずるために,"盲目的推論(blindfolded reasoning)"プロキシを活用する。さらに,自由形式VLタスク間の学習をスケールするために,構造化アルゴリズムを用いて高分散LLMを置き換える構造化言語検証を提案する。これらのテクニックはModality-Aware Credit Assignment (MoCA)メカニズムに統合される。これは、報酬をエラーの特定のソース(悪い視線か悪い思考のどちらか)にルーティングするメカニズムであり、単一のVLMが幅広いタスクスペクトルにわたって同時のパフォーマンス向上を達成することを可能にする。

論文の概要: Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

関連論文リスト