Fugu-MT 論文翻訳(概要): SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning

論文の概要: SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning

arxiv url: http://arxiv.org/abs/2604.17385v1
Date: Sun, 19 Apr 2026 11:21:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.497824
Title: SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
Title（参考訳）: SpaceImaginer:空間推論のための適応型ビジュアルイマジネーションを目指して
Authors: Yian Li, Yang Jiao, Bin Zhu, Tianwen Qian, Shaoxiang Chen, Jingjing Chen, Yu-Gang Jiang,
Abstract要約: 空間知能は、視覚的な観察から幾何学的および物理的構造を推論する能力を指すもので、大きな言語モデルにとって重要な課題である。テキスト推論と視覚的想像力を組み合わせた統合型マルチモーダル生成フレームワークを提案する。本フレームワークでは,高レベルなセマンティックプランニングのためのテキストチェーンと,幾何感応的な状態変換と整合性保存のための視覚的想像力を用いて,分割・対数戦略を採用している。
参考スコア（独自算出の注目度）: 67.67774742200626
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Spatial intelligence, which refers to the ability to reason about geometric and physical structure from visual observations, remains a core challenge for multimodal large language models. Despite promising performance, recent multimodal large language models (MLLMs) often exhibit fragile reasoning traces in spatial intelligence tasks that involve consistent spatial state recognition. We argue that these failures stem from a mismatch between the spatial recognition mechanism and the text-only reasoning behavior of these MLLMs. Effective spatial reasoning requires low-level geometric structure to be faithfully preserved and updated throughout the reasoning process, whereas textual representations tend to abstract away precisely these critical details. To address this issue, we propose SpatialImaginer, a unified multimodal generation framework that integrates textual reasoning with visual imagination. Our framework adopts a divide-and-conquer strategy, using text chain-of-thought for high-level semantic planning and the visual imagination for geometry-sensitive state transformation and consistency preservation. To support this capability, we further introduce a difficulty-aware data engine with closed-loop verification to train the model to invoke visual imagination selectively when stable spatial state tracking is required. Extensive experiments on diverse spatial intelligence benchmarks show that SpatialImaginer achieves state-of-the-art performance and substantially improves robustness on complex multi-step spatial reasoning tasks.
Abstract（参考訳）: 空間知能は、視覚的な観察から幾何学的・物理的構造を推論する能力を指すが、多モーダルな大言語モデルにとって依然として重要な課題である。有望な性能にもかかわらず、最近のマルチモーダル大言語モデル(MLLM)は、一貫した空間状態認識を含む空間知能タスクにおいて脆弱な推論トレースを示すことが多い。これらの失敗は,これらのMLLMの空間認識機構とテキストのみの推論動作のミスマッチに起因すると論じる。効果的な空間的推論では、推論過程を通して低レベルの幾何学的構造を忠実に保存し、更新する必要があるが、テキスト表現はこれらの重要な詳細を正確に抽象化する傾向がある。そこで本研究では,テキスト推論と視覚的想像力を統合したマルチモーダル生成フレームワークであるSpatialImaginerを提案する。本フレームワークでは,高レベルなセマンティックプランニングのためのテキストチェーンと,幾何感応的な状態変換と整合性保存のための視覚的想像力を用いて,分割・対数戦略を採用している。この機能をサポートするため,安定な空間状態追跡が必要な場合,モデルに視覚的想像力を選択的に呼び出すためのクローズドループ検証を備えた難易度対応データエンジンも導入する。多様な空間インテリジェンスベンチマークの広範な実験により、SpatialImaginerは最先端の性能を達成し、複雑な多段階空間推論タスクの堅牢性を大幅に向上することが示された。

論文の概要: SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning

関連論文リスト