Fugu-MT 論文翻訳(概要): See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

論文の概要: See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

arxiv url: http://arxiv.org/abs/2604.24339v1
Date: Mon, 27 Apr 2026 11:31:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.923773
Title: See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
Title（参考訳）: より深く考える - 低レベルのビジュアルキューとリフレクションによるVLMの推論能力の向上
Authors: Zhiheng Wu, Tong Wang, Shuning Wang, Naiming Liu, Yumeng Zhang,
Abstract要約: 本稿では、視覚言語モデル(VLM)のための統合マルチモーダルインターリーブ推論フレームワーク textbfForeSight を提案する。基本的な視覚情報を推論チェーンに統合する低レベルの視覚ツールセットを導入し、きめ細かい視覚的特徴の無視を緩和する。マスクに基づく視覚フィードバック機構は、思考プロセスに視覚反射を組み込むことで、モデルが動的に再検査し、その答えを更新することを可能にする。
参考スコア（独自算出の注目度）: 9.296609051671487
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework \textbf{ForeSight}, which enables VLMs to \textbf{See Further} with low-level visual cues and \textbf{Think Deeper} with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL, ForeSight learns to autonomously decide on tool invocation and answer verification, with the final answer accuracy as the reward signal. To evaluate the performance of the proposed framework, we construct a new dataset, Character and Grounding SalBench (CG-SalBench), based on the SalBench dataset. Experimental results demonstrate that the ForeSight-7B model significantly outperforms other models with the same parameter scale, and even surpasses the current SOTA closed-source models on certain metrics.
Abstract（参考訳）: 視覚言語モデル(VLM)の最近の進歩は、強化推論において強化学習(RL)の恩恵を受けている。しかし、既存の手法は、低レベルの視覚情報や効果的な視覚フィードバックの欠如など、重要な制限に直面している。これらの問題に対処するために、VLMが低レベルな視覚的手がかりを持つ「textbf{See further}」と効果的な視覚的フィードバックを持つ「textbf{Think Deeper}」を実現するための統合型マルチモーダルインターリーブ推論フレームワーク「textbf{ForeSight}」を提案する。まず、重要な視覚情報を推論チェーンに統合する低レベルの視覚ツールセットを導入し、きめ細かい視覚的特徴の無視を緩和する。第二に、マスクに基づく視覚フィードバック機構は、思考プロセスに視覚反射を組み込むことで、モデルが動的に再検査し、その答えを更新することを可能にする。 RLによって駆動されるForeSightは、ツールの呼び出しを自律的に決定し、最終的な回答精度を報奨信号として答えることを学ぶ。提案するフレームワークの性能を評価するため,SalBenchデータセットに基づく新しいデータセットであるキャラクタとグラウンドング・サルベンチ(CG-SalBench)を構築した。実験結果から、ForeSight-7Bモデルは、同じパラメータスケールで他のモデルよりも大幅に優れており、特定のメトリクス上での現在のSOTAクローズソースモデルよりもはるかに優れていることが示された。

論文の概要: See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

関連論文リスト