Fugu-MT 論文翻訳(概要): From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

論文の概要: From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

arxiv url: http://arxiv.org/abs/2605.20177v1
Date: Tue, 19 May 2026 17:58:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.577873
Title: From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models
Title（参考訳）: 視線から思考へ:知覚と推論の分離は視線言語モデルの訓練後改善する
Authors: Juncheng Wu, Hardy Chen, Haoqin Tu, Xianfeng Tang, Freda Shi, Hui Liu, Hanqing Lu, Cihang Xie, Yuyin Zhou,
Abstract要約: 視覚言語モデル(VLM)における知覚と推論の相互作用について、3つの異なる訓練段階に分解して検討する。提案手法を用いてトレーニングしたモデルでは,20.8%の精度で推論精度が1.5%向上した。
参考スコア（独自算出の注目度）: 66.95781712577315
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.
Abstract（参考訳）: 近年の視覚言語モデル(VLM)の進歩は、長いチェーン・オブ・シークレット推論を強調しているが、視覚的タスクにおけるそれらのパフォーマンスは、推論自体ではなく視覚的知覚の欠如によって主に制限されている。本研究では,視覚的知覚,視覚的推論,テキスト的推論という3つの訓練段階に分解することで,VLM後学習における知覚と推論の相互作用を体系的に研究する。私たちは視覚的知覚が (a)特化データによる目標最適化が必要である。 b) 視覚的推論を精錬する前に、段階訓練により固形化すべき基本足場として機能し、 (c)は字幕ベースのSFTよりもRLにより効果的に学習される。複数のVLMを対象とした実験により、段階学習は統合学習よりも視覚知覚と推論性能を一貫して改善することを示した。特に、我々のアプローチで訓練されたモデルは、20.8%の短い推論トレースで1.5%高い推論精度を実現し、優れた認識が過剰な推論の必要性を減らすことを示唆している。さらに、この能力に基づくステージングは、従来の難易度ベースのカリキュラムと直交する新しいカリキュラム次元を示し、両者の利得を組み合わせることでさらに加法的な利得が得られることを示す。本稿では,オープンウェイト VLM において,WeMath では +5.2%,RealWorldQA では +3.7% のタスクに対して,いくつかの視覚的数学および知覚に関する高度な結果が得られた。

論文の概要: From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

関連論文リスト