Fugu-MT 論文翻訳(概要): FOUND-IT: Foundation-model-first Task-driven 3D Scene Graphs with Granularity on Demand

論文の概要: FOUND-IT: Foundation-model-first Task-driven 3D Scene Graphs with Granularity on Demand

arxiv url: http://arxiv.org/abs/2605.25371v1
Date: Mon, 25 May 2026 02:52:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:19.259239
Title: FOUND-IT: Foundation-model-first Task-driven 3D Scene Graphs with Granularity on Demand
Title（参考訳）: FOUND-IT: 需要の粒度を考慮したファウンデーションモデルファーストのタスク駆動型3Dシーングラフ
Authors: Dominic Maggio, Nicolas Gorlo, Luca Carlone,
Abstract要約: 本研究では,階層型タスク駆動型3次元シーングラフをモノクロカメラを用いてリアルタイムに構築する手法を提案する。シーングラフの幾何学的属性を推定するために,幾何学的基礎モデルを利用する。我々のアプローチは、タスクに応じて地図内のオブジェクトや領域の粒度を調整するという意味でタスク駆動である。
参考スコア（独自算出の注目度）: 13.770305070674299
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present the first approach to build hierarchical task-driven 3D scene graphs of arbitrary indoor or outdoor environments using an uncalibrated monocular camera in real-time. We leverage geometric foundation models to estimate geometric attributes of the scene graph (e.g., object bounding boxes), but we also observe that traversability information (the "places" layer of a scene graph) can be directly reconstructed by adding an extra head to existing geometric foundation models, like VGGT. Our approach is task-driven in the sense that we adjust the granularity of the objects and regions in the map depending on the task; for instance, during a manipulation task, our approach is able to resolve small knobs on a stove, while during a navigation task it can focus on large objects (e.g., the entire stove). However, in a major departure from related work, we consider the realistic case where the list of tasks is not predefined and fixed, but evolves as the robot operates. This naturally allows dealing with complex loco-manipulation tasks, where the robot can dynamically adjust its representation as the task unfolds. We dub the resulting approach FOUND-IT. FOUND-IT also includes an agentic approach to query information in the scene graph. In addition to achieving 79% higher accuracy on the ASHiTA SG3D task grounding benchmark, we demonstrate FOUND-IT runs in real-time on a ground robot using a Jetson Thor. Furthermore, to highlight the robustness of our method, we demonstrate constructing 3D scene graphs on casually captured realtor apartment tours from YouTube. Code will be made available upon publication.
Abstract（参考訳）: 本研究では,非校正単眼カメラを用いて,任意の屋内・屋外環境の階層的タスク駆動型3次元シーングラフをリアルタイムに構築する手法を提案する。我々は、幾何学的基礎モデルを利用してシーングラフの幾何学的属性(例えば、オブジェクト境界ボックス)を推定するが、また、VGGTのような既存の幾何学的基礎モデルに追加の頭部を追加することで、移動可能性情報(シーングラフの「場所」層)を直接再構成することができる。例えば、操作タスクの間、我々のアプローチはストーブ上の小さなノブを解決できますが、ナビゲーションタスクの間は大きなオブジェクト(例えば、ストーブ全体)にフォーカスすることができます。しかし、関連する作業との大きな違いとして、タスクのリストが事前に定義されたり、固定されたりするのではなく、ロボットの動作によって進化する現実的なケースを考える。ロボットはタスクが展開するにつれて、その表現を動的に調整することができる。 FOUND-ITのアプローチを実証する。 FOUND-ITはまた、シーングラフで情報をクエリするためのエージェント的なアプローチも含んでいる。また,ASHiTA SG3Dタスク接地ベンチマークの精度を79%向上させるとともに,Jetson Thorを用いた地上ロボット上でFOUND-ITをリアルタイムに動作させることを実証した。さらに,本手法のロバスト性を強調するために,YouTubeからカジュアルに捕獲したリアルマンションツアーに3次元シーングラフを構築した。コードは出版時に公開されます。

論文の概要: FOUND-IT: Foundation-model-first Task-driven 3D Scene Graphs with Granularity on Demand

関連論文リスト