Fugu-MT 論文翻訳(概要): Mono3DVG-EnSD: Enhanced Spatial-aware and Dimension-decoupled Text Encoding for Monocular 3D Visual Grounding

論文の概要: Mono3DVG-EnSD: Enhanced Spatial-aware and Dimension-decoupled Text Encoding for Monocular 3D Visual Grounding

arxiv url: http://arxiv.org/abs/2511.06908v1
Date: Mon, 10 Nov 2025 10:02:30 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-11 21:18:45.196039
Title: Mono3DVG-EnSD: Enhanced Spatial-aware and Dimension-decoupled Text Encoding for Monocular 3D Visual Grounding
Title（参考訳）: Mono3DVG-EnSD:モノクロ3次元視覚グラウンドのための空間認識と次元分離型テキストエンコーディング
Authors: Yuzhen Li, Min Liu, Zhaoyang Li, Yuan Bian, Xueping Wang, Erbo Zhai, Yaonan Wang,
Abstract要約: CLIP-Guided Lexical Certainty Adapter (CLIP-LCA) と Dimension-Decoupled Module (D2M) の2つの主要なコンポーネントを統合する新しいフレームワーク Mono3DVG-EnSD を提案する。特に,Far(Acc@0.5)の難易度を+13.54%向上させる手法を提案する。
参考スコア（独自算出の注目度）: 42.41930714202838
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Monocular 3D Visual Grounding (Mono3DVG) is an emerging task that locates 3D objects in RGB images using text descriptions with geometric cues. However, existing methods face two key limitations. Firstly, they often over-rely on high-certainty keywords that explicitly identify the target object while neglecting critical spatial descriptions. Secondly, generalized textual features contain both 2D and 3D descriptive information, thereby capturing an additional dimension of details compared to singular 2D or 3D visual features. This characteristic leads to cross-dimensional interference when refining visual features under text guidance. To overcome these challenges, we propose Mono3DVG-EnSD, a novel framework that integrates two key components: the CLIP-Guided Lexical Certainty Adapter (CLIP-LCA) and the Dimension-Decoupled Module (D2M). The CLIP-LCA dynamically masks high-certainty keywords while retaining low-certainty implicit spatial descriptions, thereby forcing the model to develop a deeper understanding of spatial relationships in captions for object localization. Meanwhile, the D2M decouples dimension-specific (2D/3D) textual features from generalized textual features to guide corresponding visual features at same dimension, which mitigates cross-dimensional interference by ensuring dimensionally-consistent cross-modal interactions. Through comprehensive comparisons and ablation studies on the Mono3DRefer dataset, our method achieves state-of-the-art (SOTA) performance across all metrics. Notably, it improves the challenging Far(Acc@0.5) scenario by a significant +13.54%.
Abstract（参考訳）: モノクロ3DVG(Monocular 3D Visual Grounding)は、幾何学的手がかりを持つテキスト記述を用いて、RGB画像中の3Dオブジェクトを探索する新興タスクである。しかし、既存の方法には2つの重要な制限がある。第一に、それらはしばしば、重要な空間的記述を無視しながら対象物を明確に識別する高確かさのキーワードに過剰に依存する。第二に、一般化されたテキスト特徴は2Dおよび3D記述情報の両方を含み、特異な2Dまたは3D視覚特徴と比較して細部を付加する。この特徴は、テキスト誘導下で視覚的特徴を洗練する際に、二次元的干渉を引き起こす。このような課題を克服するために,CLIP-Guided Lexical Certainty Adapter (CLIP-LCA) と Dimension-Decoupled Module (D2M) という,2つの重要なコンポーネントを統合する新しいフレームワークであるMono3DVG-EnSDを提案する。 CLIP-LCAは、低精度な暗黙的空間記述を維持しながら、高確かさのキーワードを動的にマスキングし、オブジェクトローカライゼーションのためのキャプション内の空間関係をより深く理解させる。一方、D2Mは、次元特異的(2D/3D)テキスト特徴を一般化されたテキスト特徴から切り離して、同じ次元で対応する視覚特徴を導く。提案手法は,Mono3DReferデータセットの総合的な比較とアブレーション研究を通じて,全指標にわたるSOTA(State-of-the-art)性能を実現する。特に、Far(Acc@0.5)のシナリオを+13.54%改善している。

論文の概要: Mono3DVG-EnSD: Enhanced Spatial-aware and Dimension-decoupled Text Encoding for Monocular 3D Visual Grounding

関連論文リスト