Fugu-MT 論文翻訳(概要): Unleashing Perception-Time Scaling to Multimodal Reasoning Models

論文の概要: Unleashing Perception-Time Scaling to Multimodal Reasoning Models

arxiv url: http://arxiv.org/abs/2510.08964v1
Date: Fri, 10 Oct 2025 03:17:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:48.033101
Title: Unleashing Perception-Time Scaling to Multimodal Reasoning Models
Title（参考訳）: マルチモーダル推論モデルへの知覚時間スケーリングの開放
Authors: Yifan Li, Zhenghao Chen, Ziheng Wu, Kun Zhou, Ruipu Luo, Can Zhang, Zhentao He, Yufei Zhan, Wayne Xin Zhao, Minghui Qiu,
Abstract要約: 推論時間スケーリングの最近の進歩は、LVLM(Large Vision-Language Models)の推論能力を大幅に向上させた。この成功に触発されて、同様の戦略がマルチモーダル推論に適用されたが、視覚的知覚への影響は未だ不明である。本稿では,トークンに富む知覚を促進する新しいパラダイムである知覚時間スケーリング(PTS)を提案し,複雑な知覚問題を中間的抽出可能なサブプロブレムに分解する。
参考スコア（独自算出の注目度）: 60.578179197783754
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in inference-time scaling, particularly those leveraging reinforcement learning with verifiable rewards, have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs). Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear. To investigate this gap, we introduce DisTANCE, a perception-centric benchmark for visual estimation tasks. Evaluation results show that LVLMs exhibit limited estimation precision, and inference-time scaling offers only marginal gains. We attribute this to the fast perception paradigm of current LVLMs, where visual understanding is treated as a one-shot output without modeling the underlying perceptual process. To address this, we propose Perception-Time Scaling (PTS), a novel paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems, thereby enabling perception to align with and benefit from inference-time scaling. Combined with reinforcement learning techniques, PTS significantly improves perception accuracy, raising high-precision performance on DisTANCE from 8.0% to 64.7%, and generalizes well to out-of-domain tasks. Surprisingly, even though PTS data are purely synthetic, combining them with math reasoning data yields consistent gains in both reasoning and real-world perception benchmarks. Further analysis reveals that PTS introduces more perception-related tokens and increases the model's attention to image tokens. Our code and data will be publicly released.
Abstract（参考訳）: 推論時間スケーリングの最近の進歩、特に、強化学習と検証可能な報酬を活用するものは、LVLM(Large Vision-Language Models)の推論能力を大幅に向上させた。この成功に触発されて、同様の戦略がマルチモーダル推論に適用されたが、視覚的知覚への影響は未だ不明である。このギャップを調査するために、視覚的推定タスクのための知覚中心のベンチマークであるDisTANCEを紹介する。評価の結果,LVLMは限られた推定精度を示し,推定時間スケーリングは限界ゲインのみを提供することがわかった。これは、現在のLVLMの高速認識パラダイムによるもので、視覚的理解は、基礎となる知覚過程をモデル化することなく、ワンショット出力として扱われる。そこで我々は,トークンに富む知覚を促進する新しいパラダイムである知覚時間スケーリング(PTS)を提案する。強化学習技術と組み合わせることで、PTSは認識精度を大幅に向上し、Distanceの高精度性能を8.0%から64.7%に引き上げ、ドメイン外のタスクに最適化する。驚くべきことに、PSSのデータは純粋に合成されたものの、数学の推論データと組み合わせると、推論と実世界の知覚のベンチマークの両方で一貫した利得が得られる。さらなる分析により、PTSはより多くの知覚関連トークンを導入し、画像トークンに対するモデルの注意を増すことが明らかになった。コードとデータは公開されます。

論文の概要: Unleashing Perception-Time Scaling to Multimodal Reasoning Models

関連論文リスト