Fugu-MT 論文翻訳(概要): GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models

論文の概要: GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2603.16461v1
Date: Tue, 17 Mar 2026 12:43:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.280187
Title: GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models
Title（参考訳）: GAP-MLLM:マルチモーダル大言語モデルにおける3次元空間知覚の活性化のための幾何学的事前学習
Authors: Jiaxin Zhang, Junjun Jiang, Haijie Li, Youyu Chen, Kui Jiang, Dave Zhenyu Chen,
Abstract要約: このギャップは、幾何学的事前の不足から生じるものではなく、訓練パラダイムの誤った調整から生じるものである、と我々は主張する。既存のアプローチでは、通常、特徴の結合を示唆し、幾何学的な監督なしに下流のタスクを直接最適化する。本稿では,下流適応前の構造知覚を明示的に活性化する幾何学的事前学習パラダイムであるGAP-MLLMを提案する。
参考スコア（独自算出の注目度）: 70.61152292499737
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) demonstrate exceptional semantic reasoning but struggle with 3D spatial perception when restricted to pure RGB inputs. Despite leveraging implicit geometric priors from 3D reconstruction models, image-based methods still exhibit a notable performance gap compared to methods using explicit 3D data. We argue that this gap does not arise from insufficient geometric priors, but from a misalignment in the training paradigm: text-dominated fine-tuning fails to activate geometric representations within MLLMs. Existing approaches typically resort to naive feature concatenation and optimize directly for downstream tasks without geometry-specific supervision, leading to suboptimal structural utilization. To address this limitation, we propose GAP-MLLM, a Geometry-Aligned Pre-training paradigm that explicitly activates structural perception before downstream adaptation. Specifically, we introduce a visual-prompted joint task that compels the MLLMs to predict sparse pointmaps alongside semantic labels, thereby enforcing geometric awareness. Furthermore, we design a multi-level progressive fusion module with a token-level gating mechanism, enabling adaptive integration of geometric priors without suppressing semantic reasoning. Extensive experiments demonstrate that GAP-MLLM significantly enhances geometric feature fusion and consistently enhances performance across 3D visual grounding, 3D dense captioning, and 3D video object detection tasks.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、純粋なRGB入力に制限された場合の3次元空間認識に難色を示す。 3次元再構成モデルからの暗黙的な幾何的先行性を活用するにもかかわらず、画像ベースの手法は、明示的な3次元データを用いた手法と比較しても顕著な性能差を示している。我々は、このギャップは、不十分な幾何学的先入観から生じるものではなく、訓練パラダイムにおける不整合から生じるものであると主張し、テキストによる微調整はMLLM内の幾何学的表現を活性化しない。既存のアプローチでは、通常、特徴の結合を単純化して、幾何学的な監督なしに下流のタスクを直接最適化し、最適な構造的利用をもたらす。この制限に対処するために、下流適応前に構造知覚を明示的に活性化する幾何適応型事前学習パラダイムであるGAP-MLLMを提案する。具体的には、MLLMを補完し、意味ラベルとともにスパースポイントマップを予測し、幾何学的認識を強制する視覚プロンプト共同タスクを導入する。さらに,トークンレベルのゲーティング機構を備えたマルチレベルプログレッシブ・フュージョン・モジュールを設計し,意味的推論を抑えることなく,幾何学的事前の適応的な統合を可能にする。 GAP-MLLMは幾何学的特徴融合を著しく向上し、3次元視覚的接地、3次元高密度キャプション、および3次元映像オブジェクト検出タスクのパフォーマンスを一貫して向上させる。

論文の概要: GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models

関連論文リスト