Social 3D Scene Graphs: Modeling Human Actions and Relations for Interactive Service Robots
- URL: http://arxiv.org/abs/2509.24966v1
- Date: Mon, 29 Sep 2025 16:00:40 GMT
- Title: Social 3D Scene Graphs: Modeling Human Actions and Relations for Interactive Service Robots
- Authors: Ermanno Bartoli, Dennis Rotondi, Buwei He, Patric Jensfelt, Kai O. Arras, Iolanda Leite,
- Abstract summary: We introduce Social 3D Scene Graphs, an augmented 3D Scene Graph representation that captures humans, their attributes, activities and relationships in the environment, both local and remote.<n>Our representation improves human activity prediction and reasoning about human-environment relations, paving the way toward socially intelligent robots.
- Score: 5.8503433899583905
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Understanding how people interact with their surroundings and each other is essential for enabling robots to act in socially compliant and context-aware ways. While 3D Scene Graphs have emerged as a powerful semantic representation for scene understanding, existing approaches largely ignore humans in the scene, also due to the lack of annotated human-environment relationships. Moreover, existing methods typically capture only open-vocabulary relations from single image frames, which limits their ability to model long-range interactions beyond the observed content. We introduce Social 3D Scene Graphs, an augmented 3D Scene Graph representation that captures humans, their attributes, activities and relationships in the environment, both local and remote, using an open-vocabulary framework. Furthermore, we introduce a new benchmark consisting of synthetic environments with comprehensive human-scene relationship annotations and diverse types of queries for evaluating social scene understanding in 3D. The experiments demonstrate that our representation improves human activity prediction and reasoning about human-environment relations, paving the way toward socially intelligent robots.
Related papers
- Simple 3D Pose Features Support Human and Machine Social Scene Understanding [1.411894456054802]
We show that humans rely on 3D visuospatial pose information to make social interaction judgments.<n>We extract 3D joint positions of people in short video clips depicting everyday human actions.<n>Strikingly, 3D joint positions outperformed most current AI vision models.
arXiv Detail & Related papers (2025-11-06T02:19:26Z) - HOIverse: A Synthetic Scene Graph Dataset With Human Object Interactions [9.204149287692598]
HOIverse is a synthetic dataset at the intersection of scene graph and human-object interaction.<n>We compute parametric relations between various pairs of objects and human-object pairs.<n>We benchmark our dataset on state-of-the-art scene graph generation models to predict parametric relations and human-object interactions.
arXiv Detail & Related papers (2025-06-24T14:00:31Z) - Jointly Understand Your Command and Intention:Reciprocal Co-Evolution between Scene-Aware 3D Human Motion Synthesis and Analysis [80.50342609047091]
Scene-aware text-to-human synthesis generates diverse indoor motion samples from the same textual description.<n>We propose a cascaded generation strategy that factorizes text-driven scene-specific human motion generation into three stages.<n>We jointly improve realistic human motion synthesis and robust human motion analysis in 3D scenes.
arXiv Detail & Related papers (2025-03-01T06:56:58Z) - ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation [17.438484695828276]
We present ZeroHSI, a novel approach that enables zero-shot 4D human-scene interaction synthesis.<n>Our key insight is to distill human-scene interactions from state-of-the-art video generation models.<n>ZeroHSI can synthesize realistic human motions in both static scenes and environments with dynamic objects.
arXiv Detail & Related papers (2024-12-24T18:55:38Z) - GenZI: Zero-Shot 3D Human-Scene Interaction Generation [39.9039943099911]
We propose GenZI, the first zero-shot approach to generating 3D human-scene interactions.
Key to GenZI is our distillation of interaction priors from large vision-language models (VLMs), which have learned a rich semantic space of 2D human-scene compositions.
In contrast to existing learning-based approaches, GenZI circumvents the conventional need for captured 3D interaction data.
arXiv Detail & Related papers (2023-11-29T15:40:11Z) - HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes [54.61610144668777]
We present a novel scene-and-language conditioned generative model that can produce 3D human motions in 3D scenes.
Our experiments demonstrate that our model generates diverse and semantically consistent human motions in 3D scenes.
arXiv Detail & Related papers (2022-10-18T10:14:11Z) - Contact-aware Human Motion Forecasting [87.04827994793823]
We tackle the task of scene-aware 3D human motion forecasting, which consists of predicting future human poses given a 3D scene and a past human motion.
Our approach outperforms the state-of-the-art human motion forecasting and human synthesis methods on both synthetic and real datasets.
arXiv Detail & Related papers (2022-10-08T07:53:19Z) - Triangular Character Animation Sampling with Motion, Emotion, and
Relation [78.80083186208712]
We present a novel framework to sample and synthesize animations by associating the characters' body motions, facial expressions, and social relations.
Our method can provide animators with an automatic way to generate 3D character animations, help synthesize interactions between Non-Player Characters (NPCs) and enhance machine emotion intelligence in virtual reality (VR)
arXiv Detail & Related papers (2022-03-09T18:19:03Z) - HSPACE: Synthetic Parametric Humans Animated in Complex Environments [67.8628917474705]
We build a large-scale photo-realistic dataset, Human-SPACE, of animated humans placed in complex indoor and outdoor environments.
We combine a hundred diverse individuals of varying ages, gender, proportions, and ethnicity, with hundreds of motions and scenes, in order to generate an initial dataset of over 1 million frames.
Assets are generated automatically, at scale, and are compatible with existing real time rendering and game engines.
arXiv Detail & Related papers (2021-12-23T22:27:55Z) - PLACE: Proximity Learning of Articulation and Contact in 3D Environments [70.50782687884839]
We propose a novel interaction generation method, named PLACE, which explicitly models the proximity between the human body and the 3D scene around it.
Our perceptual study shows that PLACE significantly improves the state-of-the-art method, approaching the realism of real human-scene interaction.
arXiv Detail & Related papers (2020-08-12T21:00:10Z) - 3D Dynamic Scene Graphs: Actionable Spatial Perception with Places,
Objects, and Humans [27.747241700017728]
We present a unified representation for actionable spatial perception: 3D Dynamic Scene Graphs.
3D Dynamic Scene Graphs can have a profound impact on planning and decision-making, human-robot interaction, long-term autonomy, and scene prediction.
arXiv Detail & Related papers (2020-02-15T00:46:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.