Fugu-MT 論文翻訳(概要): LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models

論文の概要: LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2604.09712v1
Date: Wed, 08 Apr 2026 06:28:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:15.620799
Title: LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models
Title（参考訳）: LAST:マルチモーダル大規模言語モデルにおける空間推論の強化を目的としたツールの活用
Authors: Shi-Yu Tian, Zhi Zhou, Kun-Yang Yu, Ming Yang, Yang Chen, Ziqiao Shang, Lan-Zhe Guo, Yu-Feng Li,
Abstract要約: 空間的推論は知的なシステムが物理的世界を理解し相互作用する基盤となる能力ですツール強化空間推論のための統一的フレームワーク LAST を提案する。 LASTは、異種ツール呼び出しをアトミック命令に呼び出す、インタラクティブなサンドボックスであるLAST-Boxを備えている。また、ツールのアウトプットの理解から、熟練した適応的なツールの実行まで、モデルをガイドする3段階のプログレッシブトレーニング戦略も設計しています。
参考スコア（独自算出の注目度）: 27.76634830542925
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spatial reasoning is a cornerstone capability for intelligent systems to perceive and interact with the physical world. However, multimodal large language models (MLLMs) frequently suffer from hallucinations and imprecision when parsing complex geometric layouts. As data-driven scaling struggles to internalize structured geometric priors and spatial constraints, integrating mature, specialized vision models presents a compelling alternative. Despite its promise, applying this paradigm to spatial reasoning is hindered by two key challenges: The difficulty of invoking heterogeneous, parameter-rich tools, as well as the challenge of understanding and effectively leveraging their diverse low-level outputs (e.g., segmentation masks, depth maps) in high-level reasoning. To address these challenges, we propose LAST, a unified framework for tool-augmented spatial reasoning. LAST features an extensible interactive sandbox, termed LAST-Box, which abstracts heterogeneous tool invocations into atomic instructions and reusable spatial skills, returning multimodal hints (e.g., annotated images and textual descriptions) that can be directly consumed by LLMs. We further design a three-stage progressive training strategy that guides models from understanding tool outputs to proficient and adaptive tool invocation. Experiments on four datasets show that LAST-7B achieves around 20\% performance gains over its backbone and outperforms strong proprietary closed-source LLMs, substantially enhancing reasoning on complex spatial tasks.
Abstract（参考訳）: 空間的推論は、知的システムが物理的世界を理解し、相互作用する基盤となる能力である。しかし、マルチモーダルな大言語モデル(MLLM)は、複雑な幾何学的レイアウトを解析する際に、幻覚や不正確さに悩まされることが多い。データ駆動のスケーリングは、構造化された幾何学的先行と空間的制約を内部化するのに苦労しているため、成熟した特殊な視覚モデルを統合することは魅力的な選択肢である。その約束にもかかわらず、このパラダイムを空間的推論に適用することは、2つの重要な課題によって妨げられる: 不均一でパラメータリッチなツールを呼び出すことの難しさと、高レベルの推論において様々な低レベルのアウトプット(例えば、セグメンテーションマスク、深度マップ)を理解し、効果的に活用することの難しさ。これらの課題に対処するため,ツール強化空間推論のための統一フレームワークであるLASTを提案する。 LAST-Boxと呼ばれる拡張可能な対話型サンドボックスは、異種ツールの呼び出しを原子命令と再利用可能な空間スキルに抽象化し、LLMが直接消費できるマルチモーダルヒント(例、注釈付き画像、テキスト記述)を返す。さらに、ツールのアウトプットの理解から、熟練した適応的なツールの実行まで、モデルをガイドする3段階のプログレッシブトレーニング戦略を設計する。 4つのデータセットの実験により、LAST-7Bはバックボーンよりも約20倍の性能向上を達成し、強力なプロプライエタリなクローズドソースLCMよりも優れており、複雑な空間タスクの推論を大幅に強化していることが示された。

論文の概要: LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models

関連論文リスト