Understanding Dynamic Scenes in Ego Centric 4D Point Clouds
- URL: http://arxiv.org/abs/2508.07251v2
- Date: Tue, 12 Aug 2025 15:01:27 GMT
- Title: Understanding Dynamic Scenes in Ego Centric 4D Point Clouds
- Authors: Junsheng Huang, Shengyu Hao, Bocheng Hu, Gaoang Wang,
- Abstract summary: EgoDynamic4D is a novel QA benchmark on highly dynamic scenes.<n>We design 12 dynamic QA tasks covering agent motion, human-object interaction prediction, relation, understanding trajectory, temporal-causal reasoning, with fine-grained, explicit metrics.<n>Our method consistently outperforms baselines, validating the effectiveness of multimodal temporal modeling for ego-centric dynamic scene understanding.
- Score: 7.004204907286336
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding dynamic 4D scenes from an egocentric perspective-modeling changes in 3D spatial structure over time-is crucial for human-machine interaction, autonomous navigation, and embodied intelligence. While existing egocentric datasets contain dynamic scenes, they lack unified 4D annotations and task-driven evaluation protocols for fine-grained spatio-temporal reasoning, especially on motion of objects and human, together with their interactions. To address this gap, we introduce EgoDynamic4D, a novel QA benchmark on highly dynamic scenes, comprising RGB-D video, camera poses, globally unique instance masks, and 4D bounding boxes. We construct 927K QA pairs accompanied by explicit Chain-of-Thought (CoT), enabling verifiable, step-by-step spatio-temporal reasoning. We design 12 dynamic QA tasks covering agent motion, human-object interaction, trajectory prediction, relation understanding, and temporal-causal reasoning, with fine-grained, multidimensional metrics. To tackle these tasks, we propose an end-to-end spatio-temporal reasoning framework that unifies dynamic and static scene information, using instance-aware feature encoding, time and camera encoding, and spatially adaptive down-sampling to compress large 4D scenes into token sequences manageable by LLMs. Experiments on EgoDynamic4D show that our method consistently outperforms baselines, validating the effectiveness of multimodal temporal modeling for egocentric dynamic scene understanding.
Related papers
- Inferring Compositional 4D Scenes without Ever Seeing One [58.81854043690171]
We propose a method that consistently predicts structure and-temporal configuration of 4D/3D objects.<n>We achieve this by a carefully designed training of several spatial and temporal attentions on 2D video input.<n>By alternating between spatial and temporal reasoning, COM4D reconstructs complete and composed scenes.
arXiv Detail & Related papers (2025-12-04T21:51:47Z) - DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling [67.95038177144554]
We introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video.<n>We employ vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic captions.<n> DynamicVerse delivers a large-scale dataset consisting of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos.
arXiv Detail & Related papers (2025-12-02T18:24:27Z) - LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding [55.81291976637705]
We propose a general LMM framework with atemporal prompt for visual representation 4D scene understanding.<n>The prompt is generated by encoding 3D position and 1D time into dynamic-aware 4D coordinate embedding.<n>Experiments have been conducted to demonstrate the effectiveness of our method across different tasks in 4D scene understanding.
arXiv Detail & Related papers (2025-05-18T06:18:57Z) - 4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives [115.67081491747943]
Dynamic 3D scene representation and novel view synthesis are crucial for enabling AR/VR and metaverse applications.<n>We reformulate the reconstruction of a time-varying 3D scene as approximating its underlying 4D volume.<n>We derive several compact variants that effectively reduce the memory footprint to address its storage bottleneck.
arXiv Detail & Related papers (2024-12-30T05:30:26Z) - Grounding 3D Scene Affordance From Egocentric Interactions [52.5827242925951]
Grounding 3D scene affordance aims to locate interactive regions in 3D environments.
We introduce a novel task: grounding 3D scene affordance from egocentric interactions.
arXiv Detail & Related papers (2024-09-29T10:46:19Z) - EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone.
We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z) - Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering [23.04702935216809]
We introduce DynSuperCLEVR, the first video question answering dataset that focuses on language understanding of the dynamic properties of 3D objects.<n>We generate three types of questions, including factual queries, future predictions, and counterfactual reasoning.<n>Our method first estimates the 4D world states with a 3D generative model powered by physical priors, and then uses neural symbolic reasoning to answer the questions based on the 4D world states.
arXiv Detail & Related papers (2024-06-02T05:51:15Z) - Real-time Photorealistic Dynamic Scene Representation and Rendering with
4D Gaussian Splatting [8.078460597825142]
Reconstructing dynamic 3D scenes from 2D images and generating diverse views over time is challenging due to scene complexity and temporal dynamics.
We propose to approximate the underlying-temporal rendering volume of a dynamic scene by optimizing a collection of 4D primitives, with explicit geometry and appearance modeling.
Our model is conceptually simple, consisting of a 4D Gaussian parameterized by anisotropic ellipses that can rotate arbitrarily in space and time, as well as view-dependent and time-evolved appearance represented by the coefficient of 4D spherindrical harmonics.
arXiv Detail & Related papers (2023-10-16T17:57:43Z) - NSM4D: Neural Scene Model Based Online 4D Point Cloud Sequence
Understanding [20.79861588128133]
We introduce a generic online 4D perception paradigm called NSM4D.
NSM4D serves as a plug-and-play strategy that can be adapted to existing 4D backbones.
We demonstrate significant improvements on various online perception benchmarks in indoor and outdoor settings.
arXiv Detail & Related papers (2023-10-12T13:42:49Z) - Hindsight for Foresight: Unsupervised Structured Dynamics Models from
Physical Interaction [24.72947291987545]
Key challenge for an agent learning to interact with the world is to reason about physical properties of objects.
We propose a novel approach for modeling the dynamics of a robot's interactions directly from unlabeled 3D point clouds and images.
arXiv Detail & Related papers (2020-08-02T11:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.