Fugu-MT 論文翻訳(概要): KitchenTwin: Semantically and Geometrically Grounded 3D Kitchen Digital Twins

論文の概要: KitchenTwin: Semantically and Geometrically Grounded 3D Kitchen Digital Twins

arxiv url: http://arxiv.org/abs/2603.24684v1
Date: Wed, 25 Mar 2026 18:03:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:47.924084
Title: KitchenTwin: Semantically and Geometrically Grounded 3D Kitchen Digital Twins
Title（参考訳）: KitchenTwin: セマンティックで幾何学的にグラウンド化された3Dキッチンデジタルツイン
Authors: Quanyun Wu, Kyle Gao, Daniel Long, David A. Clausi, Jonathan Li, Yuhao Chen,
Abstract要約: 身体的なAIトレーニングと評価には、正確なメートル法とセマンティックグラウンドを備えたオブジェクト中心のデジタルツインが必要である。近年のトランスフォーマーによるフィードフォワード再構成手法は, 粗いモノクロビデオから大域点雲を効率的に予測できる。このミスマッチは、これらの次元のないクラウド予測と局所的に再構成されたオブジェクトメッシュとの信頼性の高い融合を防ぐ。
参考スコア（独自算出の注目度）: 11.881796071022157
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Embodied AI training and evaluation require object-centric digital twin environments with accurate metric geometry and semantic grounding. Recent transformer-based feedforward reconstruction methods can efficiently predict global point clouds from sparse monocular videos, yet these geometries suffer from inherent scale ambiguity and inconsistent coordinate conventions. This mismatch prevents the reliable fusion of these dimensionless point cloud predictions with locally reconstructed object meshes. We propose a novel scale-aware 3D fusion framework that registers visually grounded object meshes with transformer-predicted global point clouds to construct metrically consistent digital twins. Our method introduces a Vision-Language Model (VLM)-guided geometric anchor mechanism that resolves this fundamental coordinate mismatch by recovering an accurate real-world metric scale. To fuse these networks, we propose a geometry-aware registration pipeline that explicitly enforces physical plausibility through gravity-aligned vertical estimation, Manhattan-world structural constraints, and collision-free local refinement. Experiments on real indoor kitchen environments demonstrate improved cross-network object alignment and geometric consistency for downstream tasks, including multi-primitive fitting and metric measurement. We additionally introduce an open-source indoor digital twin dataset with metrically scaled scenes and semantically grounded and registered object-centric mesh annotations.
Abstract（参考訳）: 身体的AIトレーニングと評価には、正確なメートル法とセマンティックグラウンドを備えたオブジェクト中心のデジタルツイン環境が必要である。近年の変圧器をベースとしたフィードフォワード再構成法は, 粗いモノクロビデオから大域点雲を効率的に予測できるが, これらの測地は, 固有のスケールのあいまいさや不整合座標規則に悩まされている。このミスマッチは、これらの次元のないクラウド予測と局所的に再構成されたオブジェクトメッシュとの信頼性の高い融合を防ぐ。本稿では,変圧器で予測される大域点雲で視覚的に接地されたオブジェクトメッシュを登録し,距離的に一貫したディジタルツインを構築するための,新しいスケール対応3D融合フレームワークを提案する。本手法では,正確な実世界距離スケールを復元することにより,この基本的な座標ミスマッチを解消する視覚言語モデル(VLM)を誘導する幾何アンカー機構を導入する。これらのネットワークを融合させるために,重力方向の垂直推定,マンハッタンの世界構造制約,衝突のない局所的な改善を通じて,物理的可視性を明示的に強制する幾何対応型登録パイプラインを提案する。実際の屋内キッチン環境における実験は、マルチプリミティブなフィッティングやメートル法測定を含む下流タスクにおいて、クロスネットワークオブジェクトアライメントの改善と幾何整合性を示す。さらに、メトリックスケールされたシーンとセマンティックグラウンドと登録されたオブジェクト中心メッシュアノテーションを備えた、オープンソースの屋内デジタルツインデータセットについても紹介する。

論文の概要: KitchenTwin: Semantically and Geometrically Grounded 3D Kitchen Digital Twins

関連論文リスト