Fugu-MT 論文翻訳(概要): InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement

論文の概要: InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement

arxiv url: http://arxiv.org/abs/2604.19673v1
Date: Tue, 21 Apr 2026 16:53:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.886277
Title: InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement
Title（参考訳）: InHabit: スケーラブルな3Dヒューマンプレースメントのためのイメージファウンデーションモデルを活用する
Authors: Nikita Kister, Pradyumna YM, István Sárándi, Jiayi Wang, Anna Khoreva, Gerard Pons-Moll,
Abstract要約: InHabitは完全に自動化され、スケーラブルなデータジェネレータで、3Dシーンを人間と対話する。これは、最初の大規模なフォトリアリスティックな3D人間とシーンのインタラクションデータセットを生成する。知覚的ユーザスタディでは、私たちのデータは、最先端の78%のケースで好まれます。
参考スコア（独自算出の注目度）: 28.74898620366903
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world motion capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics that ignore rich scene context. In contrast, 2D foundation models trained on internet-scale data have implicitly acquired commonsense knowledge of human-environment interactions. To transfer this knowledge into 3D, we introduce InHabit, a fully automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces the first large-scale photorealistic 3D human-scene interaction dataset, containing 78K samples across 800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and RGB images. Augmenting standard training data with our samples improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual user study our data is preferred in 78% of cases over the state of the art.
Abstract（参考訳）: 人間のように3Dシーンを理解するには、多様な環境と意味のある相互作用をする人々の大規模なデータが必要であるが、そのようなデータは少ない。実世界のモーションキャプチャはコストが高く、制御された設定に限られるが、既存の合成データセットはリッチなシーンコンテキストを無視した単純な幾何学的ヒューリスティックに依存している。対照的に、インターネット規模のデータに基づいて訓練された2Dファンデーションモデルは、人間と環境の相互作用に関する常識的知識を暗黙的に獲得している。この知識を3Dに転送するために、人間と対話する3Dシーンを収集する、完全に自動化されスケーラブルなデータジェネレータInHabitを紹介した。 InHabitは、レンダリングされた3Dシーンが与えられたとき、視覚言語モデルは文脈的に意味のあるアクションを提案し、画像編集モデルは人間を挿入し、最適化手順は、編集結果を、シーン幾何学に整合した物理的に妥当なSMPL-Xボディに引き上げる。 Habitat-Matterport3Dに応用されたInHabitは、800のビルスケールシーンにわたる78Kサンプルと、完全な3D幾何学、SMPL-Xボディ、RGBイメージを含む、最初の大規模なフォトリアリスティックな3D人間とシーンのインタラクションデータセットを生成する。サンプルを用いて標準トレーニングデータを増強することで,RGBを用いた3次元人物シーンの再現と接触推定が向上し,ユーザ調査では,最先端の症例よりも78%がデータに好まれる。

論文の概要: InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement

関連論文リスト