Fugu-MT 論文翻訳(概要): SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens

論文の概要: SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens

arxiv url: http://arxiv.org/abs/2602.20476v1
Date: Tue, 24 Feb 2026 02:09:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-25 17:34:53.577262
Title: SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens
Title（参考訳）: SceMoS:ジオメトリグラウンドド・トークンの計画によるシーン認識型3次元人体動作合成
Authors: Anindita Ghosh, Vladislav Golyanik, Taku Komura, Philipp Slusallek, Christian Theobalt, Rishabh Dabral,
Abstract要約: SceMoSはシーン対応モーション合成フレームワークである。軽量な2Dキューを使用して、グローバルプランニングをローカル実行から切り離す。 SceMoSはTRUMANSベンチマークで最先端のモーションリアリズムと接触精度を達成する。
参考スコア（独自算出の注目度）: 89.05195827071582
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Synthesizing text-driven 3D human motion within realistic scenes requires learning both semantic intent ("walk to the couch") and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. SceMoS disentangles global planning from local execution using lightweight 2D cues and relying on (1) a text-conditioned autoregressive global motion planner that operates on a bird's-eye-view (BEV) image rendered from an elevated corner of the scene, encoded with DINOv2 features, as the scene representation, and (2) a geometry-grounded motion tokenizer trained via a conditional VQ-VAE, that uses 2D local scene heightmap, thus embedding surface physics directly into a discrete vocabulary. This 2D factorization reaches an efficiency-fidelity trade-off: BEV semantics capture spatial layout and affordance for global reasoning, while local heightmaps enforce fine-grained physical adherence without full 3D volumetric reasoning. SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark, reducing the number of trainable parameters for scene encoding by over 50%, showing that 2D scene cues can effectively ground 3D human-scene interaction.
Abstract（参考訳）: 現実的なシーンでテキスト駆動の人間の動きを合成するには、意味的な意図("walk to the couch")と物理的実現可能性(例えば衝突を避ける)の両方を学ぶ必要がある。現在の手法では、高レベルの計画と低レベルの接触推論を同時に学習する生成フレームワークを使用し、ポイントクラウドやボクセル占有グリッドのような計算コストの高い3Dシーンデータに依存している。本研究では,シーン認識型モーション合成フレームワークであるSceMoSを提案する。 SceMoSは、軽量な2Dキューを用いて局所的な計画から切り離され、(1)シーンの高角から描画された鳥の目視(BEV)画像を操作し、シーン表現としてDINOv2特徴を符号化したテキスト条件付き自己回帰的グローバルモーションプランナー、(2)条件付きVQ-VAEを用いて訓練された幾何学的グラウンドモーショントークンーで、2Dローカルなシーンの高さマップを使用し、表面物理を直接離散語彙に埋め込む。 BEVセマンティクスは空間的レイアウトとグローバルな推論の余裕を捉え、局所的なハイトマップは完全な3Dボリューム推論を行なわずに微細な物理アテンデンスを強制する。 SceMoSは、TRUMANSベンチマークで最先端のモーションリアリズムと接触精度を実現し、シーンエンコーディングのためのトレーニング可能なパラメータの数を50%以上削減し、2Dシーンキューが効果的に3Dシーンインタラクションを基礎にすることができることを示した。

論文の概要: SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens

関連論文リスト