Fugu-MT 論文翻訳(概要): RieMind: Geometry-Grounded Spatial Agent for Scene Understanding

論文の概要: RieMind: Geometry-Grounded Spatial Agent for Scene Understanding

arxiv url: http://arxiv.org/abs/2603.15386v1
Date: Mon, 16 Mar 2026 15:02:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 18:28:58.518995
Title: RieMind: Geometry-Grounded Spatial Agent for Scene Understanding
Title（参考訳）: RieMind: 風景理解のための幾何学的空間エージェント
Authors: Fernando Ropero, Erkin Turkoz, Daniel Matos, Junqing Du, Antonio Ruiz, Yanfeng Zhang, Lu Liu, Mingwei Sun, Yongliang Wang,
Abstract要約: 現在のアプローチは、端から端までのビデオ理解や、微調整による大規模空間質問に頼っている。明示的な3次元シーングラフ(3DSG)にLCMを接地する静的3次元屋内シーンのためのエージェントフレームワークを提案する。エージェントの変種は,平均33%から50%の間で,大幅なパフォーマンス向上を実現しています。
参考スコア（独自算出の注目度）: 47.34079422330063
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual Language Models (VLMs) have increasingly become the main paradigm for understanding indoor scenes, but they still struggle with metric and spatial reasoning. Current approaches rely on end-to-end video understanding or large-scale spatial question answering fine-tuning, inherently coupling perception and reasoning. In this paper, we investigate whether decoupling perception and reasoning leads to improved spatial reasoning. We propose an agentic framework for static 3D indoor scene reasoning that grounds an LLM in an explicit 3D scene graph (3DSG). Rather than ingesting videos directly, each scene is represented as a persistent 3DSG constructed by a dedicated perception module. To isolate reasoning performance, we instantiate the 3DSG from ground-truth annotations. The agent interacts with the scene exclusively through structured geometric tools that expose fundamental properties such as object dimensions, distances, poses, and spatial relationships. The results we obtain on the static split of VSI-Bench provide an upper bound under ideal perceptual conditions on the spatial reasoning performance, and we find that it is significantly higher than previous works, by up to 16\%, without task specific fine-tuning. Compared to base VLMs, our agentic variant achieves significantly better performance, with average improvements between 33\% to 50\%. These findings indicate that explicit geometric grounding substantially improves spatial reasoning performance, and suggest that structured representations offer a compelling alternative to purely end-to-end visual reasoning.
Abstract（参考訳）: 視覚言語モデル(VLM)は、屋内シーンを理解するための主要なパラダイムとなっているが、メートル法や空間的推論に苦慮している。現在のアプローチは、端から端までのビデオ理解や、微調整に答える大規模な空間的質問に依存しており、本質的には知覚と推論を結合している。本稿では,認識と推論の分離が空間的推論の改善につながるかどうかを考察する。本研究では,3次元シーングラフ(3DSG)にLCMを接地する静的3次元屋内シーンのエージェントフレームワークを提案する。ビデオを直接摂取するのではなく、各シーンは専用の知覚モジュールによって構築された永続的な3DSGとして表現される。推論性能を分離するため、3DSGを接地規則アノテーションからインスタンス化する。エージェントは、物体の寸法、距離、ポーズ、空間的関係といった基本的な性質を明らかにする構造化された幾何学的ツールを通して、シーンとのみ対話する。 VSI-Bench の静的分割結果から,空間的推論性能の理想的条件下での上界が得られ,タスク固有の微調整を伴わずに,従来の作業よりも最大16倍高い値が得られることがわかった。基本VLMと比較すると,エージェント変種は平均33\%から50\%に改善され,性能が大幅に向上する。これらの結果から,空間的推論性能は明瞭な幾何学的グラウンドリングにより著しく向上し,構造化された表現が純粋にエンドツーエンドの視覚的推論に代わる魅力的な代替手段となることが示唆された。

論文の概要: RieMind: Geometry-Grounded Spatial Agent for Scene Understanding

関連論文リスト