Fugu-MT 論文翻訳(概要): From Pixels to Concepts: Growing Rich 3D Semantic Scene Graph Forests utilizing Foundation Models

論文の概要: From Pixels to Concepts: Growing Rich 3D Semantic Scene Graph Forests utilizing Foundation Models

arxiv url: http://arxiv.org/abs/2606.23312v1
Date: Mon, 22 Jun 2026 13:26:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:40:36.735575
Title: From Pixels to Concepts: Growing Rich 3D Semantic Scene Graph Forests utilizing Foundation Models
Title（参考訳）: レンズから概念へ:基礎モデルを用いたリッチな3Dセマンティックグラフフォレストを育成
Authors: David Oberacker, Meike Deitersen, Niklas Spielbauer, Tristan Schnell, Georg Heppner, Arne Roennau,
Abstract要約: 階層的な3Dシーングラフは、幾何学的、意味的、関係的なデータを統一された空間的枠組みに統合する。現在の3Dシーングラフのアプローチは、事前に決定された関係クラスの厳密な構造に制限されることが多い。本稿では,オープンな意味関係を持つ3次元シーングラフの森林構築のための基礎モデルの可能性について考察する。
参考スコア（独自算出の注目度）: 4.137761255401348
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Operating in complex real-world environments requires robots to understand their surroundings on a functional semantic level. This demands a detailed multi-layer world model capturing the complex relations of its surroundings. Hierarchical 3D scene graphs address this challenge by integrating geometric, semantic, and relational data within a unified spatial framework. However, current 3D scene graph approaches often restrict themselves to rigid structures of pre-determined relationship classes, mostly neglecting important semantic connections, like causal connections or environmental contexts. This paper explores the potential of foundation models to build forests of 3D scene graphs with open semantic relationships to improve scene understanding and robotic task execution. We propose a method where instance-specific concept-nodes and relationships are first identified by a VLM and extended upon by a LLM, inferring broader, more abstract concept-nodes and relationships through reasoning. These object-nodes, concept-nodes, and relationships are then assembled into a forest of hierarchical 3D scene graphs, enhanced with concept-nodes to represent abstract concepts. Evaluations were conducted on the uHumans2 and ScanNet indoor dataset, validating the accuracy and relevance of the generated relationships. Downstream suitability of scene-graph forests for robotics applications is demonstrated in an open-vocabulary object-retrieval task utilizing both ScanNet data and a real-world indoor deployment using a Boston Dynamics Spot. This paper leverages foundation models to create more expressive, semantically deep 3D hierarchical scene graphs and demonstrates their potential to advance semantic and environmental understanding in robotics.
Abstract（参考訳）: 複雑な現実世界環境での運用では、ロボットは機能的な意味レベルで周囲を理解する必要がある。これは、周囲の複雑な関係を捉える、詳細な多層世界モデルを必要とする。階層的な3Dシーングラフは、幾何学的、意味的、関係的なデータを統一された空間的枠組みに統合することで、この問題に対処する。しかし、現在の3Dシーングラフのアプローチは、因果関係や環境コンテキストのような重要なセマンティックな関係を無視して、事前に決定された関係クラスの固い構造に制限されることが多い。本稿では,シーン理解とロボットタスクの実行を改善するために,オープンなセマンティックな関係を持つ3次元シーングラフの森を構築する基盤モデルの可能性について検討する。本稿では,VLMによってまずインスタンス固有の概念ノードと関係を識別し,LLMによって拡張し,より広く抽象的な概念ノードと関係を推論する手法を提案する。これらのオブジェクトノード、コンセプトノード、そして関係性は、抽象概念を表現するために概念ノードで拡張された階層的な3Dシーングラフの森に組み立てられる。 uHumans2とScanNetの屋内データセットを用いて評価を行い、生成した関係の正確性と妥当性を検証した。 ScanNetデータとBoston Dynamics Spotを用いた実世界の屋内配置の両方を利用したオープン語彙オブジェクト検索タスクにおいて、ロボット工学応用のためのシーングラフ林の下流適合性を実証した。本稿では, 基礎モデルを用いて, より表現力が高く, セマンティックに深い3次元階層的なシーングラフを作成し, ロボット工学におけるセマンティックおよび環境理解の進展の可能性を示す。

論文の概要: From Pixels to Concepts: Growing Rich 3D Semantic Scene Graph Forests utilizing Foundation Models

関連論文リスト