Fugu-MT 論文翻訳(概要): Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

論文の概要: Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

arxiv url: http://arxiv.org/abs/2603.18002v1
Date: Wed, 18 Mar 2026 17:59:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.876479
Title: Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models
Title（参考訳）: Loc3R-VLM:視覚言語モデルを用いた言語型ローカライゼーションと3次元推論
Authors: Kevin Qu, Haozhe Qi, Mihai Dusmanu, Mahdi Rad, Rui Wang, Marc Pollefeys,
Abstract要約: Loc3R-VLMは、モノクロビデオ入力から高度な3D理解機能を備えた2Dビジョンランゲージモデルを備えたフレームワークである。人間の空間認識にインスパイアされたLoc3R-VLMは、グローバルなレイアウト再構築と明示的な状況モデリングという2つの共同目的に依存している。幾何学的整合性と計量スケールの整合性を確保するために,事前学習した3次元基礎モデルから抽出した軽量カメラポーズの先行情報を活用する。
参考スコア（独自算出の注目度）: 47.045362895601556
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、視覚と言語を接続する上で、目覚ましい進歩を遂げてきたが、それでも空間的理解と視点を考慮した推論に苦慮している。近年の取り組みは、3次元空間における推論をモデルに明示的に教えるのではなく、幾何学的手がかりで入力表現を強化することを目的としている。モノクロビデオ入力による高度な3次元理解機能を備えた2次元視覚言語モデルを実現するフレームワークであるLoc3R-VLMを紹介する。人間の空間認識にインスパイアされたLoc3R-VLMは、シーン構造の全体的表現を構築するためのグローバルなレイアウト再構築と、自我中心の視点を固定するための明示的な状況モデリングという2つの共同目標に依存している。これらの目的は、3Dコンテキストにおける知覚と言語の両方を基盤とする空間的直接監督を提供する。幾何学的整合性と計量スケールの整合性を確保するために,事前学習した3次元基礎モデルから抽出した軽量カメラポーズの先行情報を活用する。 Loc3R-VLMは、言語に基づくローカライゼーションにおける最先端のパフォーマンスを実現し、位置および一般的な3D質問応答ベンチマークにおける既存の2Dおよびビデオベースのアプローチよりも優れており、我々の空間監視フレームワークが強力な3D理解を可能にすることを実証している。プロジェクトページ: https://kevinqu7.github.io/loc3r-vlm

論文の概要: Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

関連論文リスト