Fugu-MT 論文翻訳(概要): UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing

論文の概要: UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing

arxiv url: http://arxiv.org/abs/2603.08131v1
Date: Mon, 09 Mar 2026 09:10:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:15.726243
Title: UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing
Title（参考訳）: UniGround: トレーニング不要のシーンパーシングによるユニバーサル3Dビジュアルグラウンド
Authors: Jiaxi Zhang, Yunheng Wang, Wei Lu, Taowen Wang, Weisheng Xu, Shuning Zhang, Yixiao Feng, Yuetong Fang, Renjing Xu,
Abstract要約: 3Dビジュアルグラウンド(3DVG)は、ロボット工学、拡張現実、人間と機械の相互作用など、AIを具現化する上での課題である。大規模で事前訓練された基礎モデルは、この面で大きな進歩をもたらし、任意のオブジェクトを特定のシーンに配置できるオープン語彙の3DVGを可能にした。本稿では,この制約された知覚を学習不要な視覚的・幾何学的推論に置き換え,オープンワールドの3DVGをアンロックする。
参考スコア（独自算出の注目度）: 21.246395901914376
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding and localizing objects in complex 3D environments from natural language descriptions, known as 3D Visual Grounding (3DVG), is a foundational challenge in embodied AI, with broad implications for robotics, augmented reality, and human-machine interaction. Large-scale pre-trained foundation models have driven significant progress on this front, enabling open-vocabulary 3DVG that allows systems to locate arbitrary objects in a given scene. However, their reliance on pre-trained models constrains 3D perception and reasoning within the inherited knowledge boundaries, resulting in limited generalization to unseen spatial relationships and poor robustness to out-of-distribution scenes. In this paper, we replace this constrained perception with training-free visual and geometric reasoning, thereby unlocking open-world 3DVG that enables the localization of any object in any scene beyond the training data. Specifically, the proposed UniGround operates in two stages: a Global Candidate Filtering stage that constructs scene candidates through training-free 3D topology and multi-view semantic encoding, and a Local Precision Grounding stage that leverages multi-scale visual prompting and structured reasoning to precisely identify the target object. Experiments on ScanRefer and EmbodiedScan show that UniGround achieves 46.1\%/34.1\% Acc@0.25/0.5 on ScanRefer and 28.7\% Acc@0.25 on EmbodiedScan, establishing a new state-of-the-art among zero-shot methods on EmbodiedScan without any 3D supervision. We further evaluate UniGround in real-world environments under uncontrolled reconstruction conditions and substantial domain shift, showing training-free reasoning generalizes robustly beyond curated benchmarks.
Abstract（参考訳）: 3Dビジュアルグラウンド(3DVG)として知られる自然言語記述から複雑な3D環境内のオブジェクトを理解し、ローカライズすることは、ロボット工学、拡張現実、人間と機械の相互作用に幅広い意味を持つ、AIの具体化における基礎的な課題である。大規模で事前訓練された基礎モデルは、この面で大きな進歩をもたらし、任意のオブジェクトを特定のシーンに配置できるオープン語彙の3DVGを可能にした。しかし、事前学習されたモデルへの依存は、継承された知識境界内での3次元知覚と推論を制約し、その結果、空間的関係の見当たらない部分への一般化が制限され、配布外シーンへのロバスト性が低下する。本稿では、この制約された知覚を、トレーニング不要な視覚的および幾何学的推論に置き換え、トレーニングデータ以外の任意のシーンにおける任意の物体のローカライズを可能にするオープンワールド3DVGをアンロックする。具体的には、トレーニング不要な3Dトポロジとマルチビューセマンティックエンコーディングを通じてシーン候補を構築するGlobal Candidate Filteringステージと、マルチスケールの視覚的プロンプトと構造化された推論を活用して対象物体を正確に識別するLocal Precision Groundingステージである。 ScanReferとEmbodiedScanの実験では、UniGroundはScanReferで46.1\%/34.1\% Acc@0.25/0.5、EmbodiedScanで28.7\% Acc@0.25を達成した。実環境におけるUniGroundを、制御不能な再構成条件と実質的なドメインシフトで評価し、トレーニング不要な推論が、キュレートされたベンチマークを超えて頑健に一般化されていることを示す。

論文の概要: UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing

関連論文リスト