Fugu-MT 論文翻訳(概要): InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

論文の概要: InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

arxiv url: http://arxiv.org/abs/2605.26520v1
Date: Tue, 26 May 2026 04:07:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:41.611726
Title: InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward
Title（参考訳）: InterSketch: 自己修正型ビジュアルスケッチとステップワイドリワードを備えたインターリーブ型推論モデル
Authors: Zhiwei Ning, Wenwen Tong, Xiangli Kong, Shengnan Ma, Ziyi Shang, Jingcheng Ni, Tao Hu, Yong Xien Chng, Jixuan Ying, Zehuan Wu, Hanming Deng, Jie Yang, Yuanjie Zheng, Wei Liu, Lewei Lu,
Abstract要約: ヒューマンライクな思考は、典型的には、インターリーブド・ビジュアル・テクスト・チェーン・オブ・ソート(VT-CoT)による長い水平推論を伴う自己補正とステップワイズ報酬機構によってVT-CoT能力を向上するインターリーブ推論モデルであるInterSketchを導入する。ビジュアル推論ベンチマークの実験は、InterSketchの有効性を示し、Gemini-3-Proのようなプロプライエタリなモデルよりも優れている。
参考スコア（独自算出の注目度）: 24.461407883853344
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model's capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.
Abstract（参考訳）: 視覚言語モデル(VLM)は多ターン視覚推論能力を示してきたが、その推論軌道は比較的浅いままであり、テキスト中心のパラダイムで支配されており、複雑な視覚的課題に適用可能である。対照的に、人間のような思考は、典型的には長い水平推論を、視覚的テクストの連鎖(VT-CoT)とインターリーブする。このギャップを埋めるために、自己補正とステップワイズ報酬機構を通じてVT-CoT能力を向上するインターリーブ推論モデルであるInterSketchを導入する。 InterSketchは、外部ツールを使用して中間的な視覚スケッチを動的に生成し、テキスト推論でインターリーブし、長期の視覚的理解タスクに対して効果的な知覚と論理的推論を可能にする。具体的には、第1の冷間開始段階では、合成された高品質なインターリーブ付きVT-CoTデータセットを提案し、マルチターンインターリーブ付き推論および自己補正におけるモデルの能力を可能にするリフレクション機構を含む。その後の強化学習(RL)段階において、長期的推論に対するエンドオンリーの監視に固有の報酬信号の空間性を軽減するために、段階的に報奨機構を設計する。ビジュアル推論ベンチマークに関する大規模な実験は、InterSketchの有効性を示し、Gemini-3-Proのようなプロプライエタリなモデルよりも優れている。

論文の概要: InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

関連論文リスト