Fugu-MT 論文翻訳(概要): Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

論文の概要: Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

arxiv url: http://arxiv.org/abs/2509.12132v1
Date: Mon, 15 Sep 2025 16:57:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-16 17:26:23.413423
Title: Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models
Title（参考訳）: もう一度見て、ゆっくり考える:視覚-言語モデルにおける視覚反射の強化
Authors: Pu Jian, Junhong Wu, Wei Sun, Chen Wang, Shuo Ren, Jiajun Zhang,
Abstract要約: テキストのみの「スロー思考」推論の最近の進歩は、この能力を視覚言語モデル(VLM)に転送する努力を促している。冷間開始のための推論データ構築と強化学習(RL)のための報酬設計に基づく視覚反射を改善する新しいVRM textbfReflection-Vを提案する。 textbfReflection-Vは、複数のビジュアル推論ベンチマークで大幅に改善されている。
参考スコア（独自算出の注目度）: 21.588467647421865
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in text-only "slow-thinking" reasoning have prompted efforts to transfer this capability to vision-language models (VLMs), for training visual reasoning models (\textbf{VRMs}). owever, such transfer faces critical challenges: Effective "slow thinking" in VRMs requires \textbf{visual reflection}, the ability to check the reasoning process based on visual information. Through quantitative analysis, we observe that current VRMs exhibit limited visual reflection, as their attention to visual information diminishes rapidly with longer generated responses. To address this challenge, we propose a new VRM \textbf{Reflection-V}, which enhances visual reflection based on reasoning data construction for cold-start and reward design for reinforcement learning (RL). Firstly, we construct vision-centered reasoning data by leveraging an agent that interacts between VLMs and reasoning LLMs, enabling cold-start learning of visual reflection patterns. Secondly, a visual attention based reward model is employed during RL to encourage reasoning based on visual information. Therefore, \textbf{Reflection-V} demonstrates significant improvements across multiple visual reasoning benchmarks. Furthermore, \textbf{Reflection-V} maintains a stronger and more consistent reliance on visual information during visual reasoning, indicating effective enhancement in visual reflection capabilities.
Abstract（参考訳）: テキストのみの「スロー思考」推論の最近の進歩は、視覚的推論モデル(\textbf{VRMs})を訓練するための視覚言語モデル(VLMs)にこの機能を移す努力を促している。 VRMの効果的な「スローシンキング」には、視覚情報に基づいて推論プロセスをチェックする能力である「textbf{visual reflection}」が必要である。定量的分析により,現在のVRMは,視覚情報への注意がより長い応答で急速に減少するので,限られた視覚反射を示すことが明らかとなった。この課題に対処するために、冷間開始のための推論データ構築と強化学習のための報酬設計に基づく視覚反射を改善する新しいVRM \textbf{Reflection-V}を提案する。まず、視覚中心の推論データを構築し、VLMとLPMを相互作用するエージェントを活用し、視覚反射パターンの冷間開始学習を可能にする。次に、視覚情報に基づく推論を促進するために、視覚的注意に基づく報酬モデルを用いる。したがって、 \textbf{Reflection-V} は複数の視覚的推論ベンチマークで大幅に改善されている。さらに、‘textbf{Reflection-V} は視覚的推論における視覚情報へのより強固で一貫した依存を維持しており、視覚的反射能力の効果的な向上を示唆している。

論文の概要: Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

関連論文リスト