Fugu-MT 論文翻訳(概要): LanteRn: Latent Visual Structured Reasoning

論文の概要: LanteRn: Latent Visual Structured Reasoning

arxiv url: http://arxiv.org/abs/2603.25629v1
Date: Thu, 26 Mar 2026 16:41:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.383736
Title: LanteRn: Latent Visual Structured Reasoning
Title（参考訳）: LanteRn: 潜在的な視覚構造推論
Authors: André G. Viveiros, Nuno Gonçalves, Matthias Lindemann, André Martins,
Abstract要約: 本稿では,視覚的推論を潜在空間で直接実行可能にするフレームワークであるLanteRnを紹介する。 LanteRnは、推論中に連続的な視覚的思考の埋め込みを生成し、参加する能力を持つ視覚言語変換器を増強する。我々はLanteRnを3つの知覚中心ベンチマーク(VisCoT, V*, Blink)で評価する。
参考スコア（独自算出の注目度）: 7.141402207573525
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks requiring fine-grained spatial and visual understanding. While recent approaches take steps toward thinking with images by invoking tools or generating intermediate images, they either rely on external modules, or incur unnecessary computation by reasoning directly in pixel space. In this paper, we introduce LanteRn, a framework that enables LMMs to interleave language with compact latent visual representations, allowing visual reasoning to occur directly in latent space. LanteRn augments a vision-language transformer with the ability to generate and attend to continuous visual thought embeddings during inference. We train the model in two stages: supervised fine-tuning to ground visual features in latent states, followed by reinforcement learning to align latent reasoning with task-level utility. We evaluate LanteRn on three perception-centric benchmarks (VisCoT, V*, and Blink), observing consistent improvements in visual grounding and fine-grained reasoning. These results suggest that internal latent representations provide a promising direction for more efficient multimodal reasoning.
Abstract（参考訳）: 言語推論モデルは多くのタスクにおいて優れているが、現在の大規模マルチモーダルモデル(LMM)では視覚推論が難しい。結果として、ほとんどのLMMは、知覚的内容のテキスト化をデフォルトとし、空間的および視覚的理解のきめ細かいタスクに強い制限を与えている。最近のアプローチでは、ツールの呼び出しや中間画像の生成によるイメージの思考に向けられているが、外部モジュールに依存するか、ピクセル空間内で直接推論することで不要な計算を行う。本稿では,LanteRnを提案する。LanteRnは,LMMがコンパクトな視覚表現で言語をインターリーブし,視覚的推論を直接潜在空間で行うことを可能にするフレームワークである。 LanteRnは、推論中に連続的な視覚的思考の埋め込みを生成し、参加する能力を持つ視覚言語変換器を増強する。我々は、潜在状態の視覚的特徴の微調整を監督し、強化学習を行い、潜在状態の推論をタスクレベルのユーティリティと整合させる2つの段階でモデルを訓練する。我々はLanteRnを3つの知覚中心のベンチマーク(VisCoT、V*、Blink)で評価し、視覚的グラウンドリングと微粒化推論の一貫性を観察した。これらの結果は、内部潜在表現がより効率的なマルチモーダル推論に有望な方向を与えることを示唆している。

論文の概要: LanteRn: Latent Visual Structured Reasoning

関連論文リスト