Related papers: Human Motion Synthesis in 3D Scenes via Unified Scene Semantic Occupancy

Human Motion Synthesis in 3D Scenes via Unified Scene Semantic Occupancy

URL: http://arxiv.org/abs/2511.07819v1
Date: Wed, 12 Nov 2025 01:21:46 GMT
Title: Human Motion Synthesis in 3D Scenes via Unified Scene Semantic Occupancy
Authors: Gong Jingyu, Tong Kunkun, Chen Zhuoran, Yuan Chuanhan, Chen Mingang, Zhang Zhizhong, Tan Xin, Xie Yuan,
Abstract summary: We propose a human motion synthesis framework that take an unified Scene Semantic Occupancy (SSO) for scene representation, termed SSOMotion.<n>We take scene hints and movement direction derived from instructions for motion control via frame-wise scene query.<n>Experiments and ablation studies conducted on cluttered scenes using ShapeNet furniture, as well as scanned scenes from PROX and Replica datasets, demonstrate its cutting-edge performance.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human motion synthesis in 3D scenes relies heavily on scene comprehension, while current methods focus mainly on scene structure but ignore the semantic understanding. In this paper, we propose a human motion synthesis framework that take an unified Scene Semantic Occupancy (SSO) for scene representation, termed SSOMotion. We design a bi-directional tri-plane decomposition to derive a compact version of the SSO, and scene semantics are mapped to an unified feature space via CLIP encoding and shared linear dimensionality reduction. Such strategy can derive the fine-grained scene semantic structures while significantly reduce redundant computations. We further take these scene hints and movement direction derived from instructions for motion control via frame-wise scene query. Extensive experiments and ablation studies conducted on cluttered scenes using ShapeNet furniture, as well as scanned scenes from PROX and Replica datasets, demonstrate its cutting-edge performance while validating its effectiveness and generalization ability. Code will be publicly available at https://github.com/jingyugong/SSOMotion.

Related papers

SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens [89.05195827071582]
SceMoS is a scene-aware motion synthesis framework.<n>It disentangles global planning from local execution using lightweight 2D cues.<n>SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark.
arXiv Detail & Related papers (2026-02-24T02:09:12Z)
REACT3D: Recovering Articulations for Interactive Physical 3D Scenes [96.27769519526426]
REACT3D is a framework that converts static 3D scenes into simulation-ready interactive replicas with consistent geometry.<n>We achieve state-of-the-art performance on detection/segmentation and articulation metrics across diverse indoor scenes.
arXiv Detail & Related papers (2025-10-13T12:37:59Z)
ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting [54.92763171355442]
ObjectGS is an object-aware framework that unifies 3D scene reconstruction with semantic understanding.<n>We show through experiments that ObjectGS outperforms state-of-the-art methods on open-vocabulary and panoptic segmentation tasks.
arXiv Detail & Related papers (2025-07-21T10:06:23Z)
Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion [86.34232220368855]
Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner.<n>In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy.
arXiv Detail & Related papers (2025-07-08T17:59:50Z)
SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields [33.113865514268085]
Holistic 3D scene understanding is crucial for applications like augmented reality and robotic interaction.<n>Existing feed-forward 3D scene understanding methods (e.g., LSM) are limited to extracting language-based semantics from scenes.<n>We propose SemanticSplat, a feed-forward semantic-aware 3D reconstruction method.
arXiv Detail & Related papers (2025-06-11T09:56:39Z)
ReSpace: Text-Driven 3D Indoor Scene Synthesis and Editing with Preference Alignment [8.954070942391603]
ReSpace is a generative framework for text-driven 3D indoor scene synthesis and editing.<n>We leverage a dual-stage training approach combining supervised fine-tuning and preference alignment.<n>For scene editing, we employ a zero-shot LLM to handle object removal and prompts for addition.
arXiv Detail & Related papers (2025-06-03T05:22:04Z)
InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior [27.773451301040424]
InstructScene is a novel generative framework that integrates a semantic graph prior and a layout decoder. We show that the proposed method surpasses existing state-of-the-art approaches by a large margin.
arXiv Detail & Related papers (2024-02-07T10:09:00Z)
Generating Continual Human Motion in Diverse 3D Scenes [51.90506920301473]
We introduce a method to synthesize animator guided human motion across 3D scenes.<n>We decompose the continual motion synthesis problem into walking along paths and transitioning in and out of the actions specified by the keypoints.<n>Our model can generate long sequences of diverse actions such as grabbing, sitting and leaning chained together.
arXiv Detail & Related papers (2023-04-04T18:24:22Z)
Semantic Scene Completion via Integrating Instances and Scene in-the-Loop [73.11401855935726]
Semantic Scene Completion aims at reconstructing a complete 3D scene with precise voxel-wise semantics from a single-view depth or RGBD image. We present Scene-Instance-Scene Network (textitSISNet), which takes advantages of both instance and scene level semantic information. Our method is capable of inferring fine-grained shape details as well as nearby objects whose semantic categories are easily mixed-up.
arXiv Detail & Related papers (2021-04-08T09:50:30Z)
3D Sketch-aware Semantic Scene Completion via Semi-supervised Structure Prior [50.73148041205675]
The goal of the Semantic Scene Completion (SSC) task is to simultaneously predict a completed 3D voxel representation of volumetric occupancy and semantic labels of objects in the scene from a single-view observation. We propose to devise a new geometry-based strategy to embed depth information with low-resolution voxel representation. Our proposed geometric embedding works better than the depth feature learning from habitual SSC frameworks.
arXiv Detail & Related papers (2020-03-31T09:33:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.