Human Motion Synthesis in 3D Scenes via Unified Scene Semantic Occupancy
- URL: http://arxiv.org/abs/2511.07819v1
- Date: Wed, 12 Nov 2025 01:21:46 GMT
- Title: Human Motion Synthesis in 3D Scenes via Unified Scene Semantic Occupancy
- Authors: Gong Jingyu, Tong Kunkun, Chen Zhuoran, Yuan Chuanhan, Chen Mingang, Zhang Zhizhong, Tan Xin, Xie Yuan,
- Abstract summary: We propose a human motion synthesis framework that take an unified Scene Semantic Occupancy (SSO) for scene representation, termed SSOMotion.<n>We take scene hints and movement direction derived from instructions for motion control via frame-wise scene query.<n>Experiments and ablation studies conducted on cluttered scenes using ShapeNet furniture, as well as scanned scenes from PROX and Replica datasets, demonstrate its cutting-edge performance.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human motion synthesis in 3D scenes relies heavily on scene comprehension, while current methods focus mainly on scene structure but ignore the semantic understanding. In this paper, we propose a human motion synthesis framework that take an unified Scene Semantic Occupancy (SSO) for scene representation, termed SSOMotion. We design a bi-directional tri-plane decomposition to derive a compact version of the SSO, and scene semantics are mapped to an unified feature space via CLIP encoding and shared linear dimensionality reduction. Such strategy can derive the fine-grained scene semantic structures while significantly reduce redundant computations. We further take these scene hints and movement direction derived from instructions for motion control via frame-wise scene query. Extensive experiments and ablation studies conducted on cluttered scenes using ShapeNet furniture, as well as scanned scenes from PROX and Replica datasets, demonstrate its cutting-edge performance while validating its effectiveness and generalization ability. Code will be publicly available at https://github.com/jingyugong/SSOMotion.
Related papers
- SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens [89.05195827071582]
SceMoS is a scene-aware motion synthesis framework.<n>It disentangles global planning from local execution using lightweight 2D cues.<n>SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark.
arXiv Detail & Related papers (2026-02-24T02:09:12Z) - REACT3D: Recovering Articulations for Interactive Physical 3D Scenes [96.27769519526426]
REACT3D is a framework that converts static 3D scenes into simulation-ready interactive replicas with consistent geometry.<n>We achieve state-of-the-art performance on detection/segmentation and articulation metrics across diverse indoor scenes.
arXiv Detail & Related papers (2025-10-13T12:37:59Z) - ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting [54.92763171355442]
ObjectGS is an object-aware framework that unifies 3D scene reconstruction with semantic understanding.<n>We show through experiments that ObjectGS outperforms state-of-the-art methods on open-vocabulary and panoptic segmentation tasks.
arXiv Detail & Related papers (2025-07-21T10:06:23Z) - Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion [86.34232220368855]
Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner.<n>In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy.
arXiv Detail & Related papers (2025-07-08T17:59:50Z) - SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields [33.113865514268085]
Holistic 3D scene understanding is crucial for applications like augmented reality and robotic interaction.<n>Existing feed-forward 3D scene understanding methods (e.g., LSM) are limited to extracting language-based semantics from scenes.<n>We propose SemanticSplat, a feed-forward semantic-aware 3D reconstruction method.
arXiv Detail & Related papers (2025-06-11T09:56:39Z) - ReSpace: Text-Driven 3D Indoor Scene Synthesis and Editing with Preference Alignment [8.954070942391603]
ReSpace is a generative framework for text-driven 3D indoor scene synthesis and editing.<n>We leverage a dual-stage training approach combining supervised fine-tuning and preference alignment.<n>For scene editing, we employ a zero-shot LLM to handle object removal and prompts for addition.
arXiv Detail & Related papers (2025-06-03T05:22:04Z) - InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with
Semantic Graph Prior [27.773451301040424]
InstructScene is a novel generative framework that integrates a semantic graph prior and a layout decoder.
We show that the proposed method surpasses existing state-of-the-art approaches by a large margin.
arXiv Detail & Related papers (2024-02-07T10:09:00Z) - Generating Continual Human Motion in Diverse 3D Scenes [51.90506920301473]
We introduce a method to synthesize animator guided human motion across 3D scenes.<n>We decompose the continual motion synthesis problem into walking along paths and transitioning in and out of the actions specified by the keypoints.<n>Our model can generate long sequences of diverse actions such as grabbing, sitting and leaning chained together.
arXiv Detail & Related papers (2023-04-04T18:24:22Z) - Semantic Scene Completion via Integrating Instances and Scene
in-the-Loop [73.11401855935726]
Semantic Scene Completion aims at reconstructing a complete 3D scene with precise voxel-wise semantics from a single-view depth or RGBD image.
We present Scene-Instance-Scene Network (textitSISNet), which takes advantages of both instance and scene level semantic information.
Our method is capable of inferring fine-grained shape details as well as nearby objects whose semantic categories are easily mixed-up.
arXiv Detail & Related papers (2021-04-08T09:50:30Z) - 3D Sketch-aware Semantic Scene Completion via Semi-supervised Structure
Prior [50.73148041205675]
The goal of the Semantic Scene Completion (SSC) task is to simultaneously predict a completed 3D voxel representation of volumetric occupancy and semantic labels of objects in the scene from a single-view observation.
We propose to devise a new geometry-based strategy to embed depth information with low-resolution voxel representation.
Our proposed geometric embedding works better than the depth feature learning from habitual SSC frameworks.
arXiv Detail & Related papers (2020-03-31T09:33:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.