Fugu-MT 論文翻訳(概要): Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs

論文の概要: Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs

arxiv url: http://arxiv.org/abs/2604.07518v1
Date: Wed, 08 Apr 2026 18:52:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:05.518085
Title: Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs
Title（参考訳）: Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs
Authors: Mengdan Zhu, Senhao Cheng, Liang Zhao,
Abstract要約: 視覚言語モデルは、テキストCoTの視覚的情報損失により、複雑な視覚的推論に苦しむことが多い。我々は,強化潜在推論フレームワークである"Decompose, Look, and Reason"(DLR)を提案する。ビジョン中心のベンチマークの実験では、DLRは一貫して強いベースラインを上回っている。
参考スコア（独自算出の注目度）: 6.111899371682025
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models often struggle with complex visual reasoning due to the visual information loss in textual CoT. Existing methods either add the cost of tool calls or rely on localized patch-based embeddings that are insufficient to extract semantics in multi-step reasoning. We propose \emph{"Decompose, Look, and Reason" (DLR)}, a reinforced latent reasoning framework that dynamically decomposes queries into textual premises, extracts premise-conditioned continuous visual latents, and deduces answers through grounded rationales. We introduce a three-stage training pipeline and propose a novel Spherical Gaussian Latent Policy to enable effective exploration in the latent space. Extensive experiments on vision-centric benchmarks show that DLR consistently outperforms strong baselines, including text-only, interleaved multimodal CoT, and latent reasoning methods, while providing superior stepwise interpretability.
Abstract（参考訳）: 視覚言語モデルは、テキストCoTの視覚的情報損失により、複雑な視覚的推論に苦しむことが多い。既存のメソッドは、ツールコールのコストを追加するか、マルチステップ推論においてセマンティクスを抽出するのに不十分な、ローカライズされたパッチベースの埋め込みに依存する。このフレームワークは動的にクエリをテキストの前提に分解し、前提条件付き連続的な視覚的潜伏者を抽出し、根拠付き理性を通して回答を推論する。本稿では,3段階の学習パイプラインを導入し,球状ガウスラテントポリシーを提案する。ビジョン中心のベンチマークに関する大規模な実験により、DLRはテキストのみ、インターリーブされたマルチモーダルCoT、潜在推論方法など、強力なベースラインを一貫して上回り、ステップワイドな解釈性を提供することが示された。

論文の概要: Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs

関連論文リスト