SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning
- URL: http://arxiv.org/abs/2512.16461v1
- Date: Thu, 18 Dec 2025 12:27:06 GMT
- Title: SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning
- Authors: Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax,
- Abstract summary: We propose a framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency.<n>SNOW processes synchronized 3D point clouds, using HDBSCAN clustering to generate segmentation proposals.<n> Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference.
- Score: 11.93789125154006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.
Related papers
- 4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer [40.29321632546414]
4DVGGT is the first Transformer-based feed-forward unified framework for 4D language grounding.<n>It integrates geometric perception and language alignment within a single architecture.<n>It can be jointly trained across multiple dynamic scenes and directly applied during inference.
arXiv Detail & Related papers (2025-12-04T18:15:27Z) - Uni4D-LLM: A Unified SpatioTemporal-Aware VLM for 4D Understanding and Generation [61.60600246983274]
Existing 3D and 4D approaches typically embed scene geometry into autogressive model for semantic understanding and diffusion model for content generation.<n>We propose Uni4D-LLM, the first unified VLM framework withtemporal awareness for 4D scene understanding and generation.
arXiv Detail & Related papers (2025-09-28T12:06:54Z) - Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z) - LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding [55.81291976637705]
We propose a general LMM framework with atemporal prompt for visual representation 4D scene understanding.<n>The prompt is generated by encoding 3D position and 1D time into dynamic-aware 4D coordinate embedding.<n>Experiments have been conducted to demonstrate the effectiveness of our method across different tasks in 4D scene understanding.
arXiv Detail & Related papers (2025-05-18T06:18:57Z) - Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos [70.07088203106443]
Existing methods rely on explicit knowledge to learn motion, resulting in suboptimal representations.<n>Prior Masked Autoentangler (MAE) frameworks struggle to bridge the gap between low-level geometry and high-level dynamics in 4D data.<n>We propose a novel self-disentangled MAE for learning expressive,riminative, and transferable 4D representations.
arXiv Detail & Related papers (2025-04-07T08:47:36Z) - DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation [50.01520547454224]
Current generative models struggle to synthesize 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS)<n>We propose DiST-4D, which disentangles the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency.<n>Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations.
arXiv Detail & Related papers (2025-03-19T13:49:48Z) - ALSTER: A Local Spatio-Temporal Expert for Online 3D Semantic
Reconstruction [62.599588577671796]
We propose an online 3D semantic segmentation method that incrementally reconstructs a 3D semantic map from a stream of RGB-D frames.
Unlike offline methods, ours is directly applicable to scenarios with real-time constraints, such as robotics or mixed reality.
arXiv Detail & Related papers (2023-11-29T20:30:18Z) - NSM4D: Neural Scene Model Based Online 4D Point Cloud Sequence
Understanding [20.79861588128133]
We introduce a generic online 4D perception paradigm called NSM4D.
NSM4D serves as a plug-and-play strategy that can be adapted to existing 4D backbones.
We demonstrate significant improvements on various online perception benchmarks in indoor and outdoor settings.
arXiv Detail & Related papers (2023-10-12T13:42:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.