Related papers: Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering

Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering

URL: http://arxiv.org/abs/2406.00622v2
Date: Wed, 23 Apr 2025 04:53:40 GMT
Title: Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering
Authors: Xingrui Wang, Wufei Ma, Angtian Wang, Shuo Chen, Adam Kortylewski, Alan Yuille,
Abstract summary: We introduce DynSuperCLEVR, the first video question answering dataset that focuses on language understanding of the dynamic properties of 3D objects.<n>We generate three types of questions, including factual queries, future predictions, and counterfactual reasoning.<n>Our method first estimates the 4D world states with a 3D generative model powered by physical priors, and then uses neural symbolic reasoning to answer the questions based on the 4D world states.
Score: 23.04702935216809
License: http://creativecommons.org/licenses/by/4.0/
Abstract: For vision-language models (VLMs), understanding the dynamic properties of objects and their interactions in 3D scenes from videos is crucial for effective reasoning about high-level temporal and action semantics. Although humans are adept at understanding these properties by constructing 3D and temporal (4D) representations of the world, current video understanding models struggle to extract these dynamic semantics, arguably because these models use cross-frame reasoning without underlying knowledge of the 3D/4D scenes. In this work, we introduce DynSuperCLEVR, the first video question answering dataset that focuses on language understanding of the dynamic properties of 3D objects. We concentrate on three physical concepts -- velocity, acceleration, and collisions within 4D scenes. We further generate three types of questions, including factual queries, future predictions, and counterfactual reasoning that involve different aspects of reasoning about these 4D dynamic properties. To further demonstrate the importance of explicit scene representations in answering these 4D dynamics questions, we propose NS-4DPhysics, a Neural-Symbolic VideoQA model integrating Physics prior for 4D dynamic properties with explicit scene representation of videos. Instead of answering the questions directly from the video text input, our method first estimates the 4D world states with a 3D generative model powered by physical priors, and then uses neural symbolic reasoning to answer the questions based on the 4D world states. Our evaluation on all three types of questions in DynSuperCLEVR shows that previous video question answering models and large multimodal models struggle with questions about 4D dynamics, while our NS-4DPhysics significantly outperforms previous state-of-the-art models. Our code and data are released in https://xingruiwang.github.io/projects/DynSuperCLEVR/.

Related papers

Learning Physics-Grounded 4D Dynamics with Neural Gaussian Force Fields [11.212256115568772]
We introduce an end-to-end neural framework that integrates 3D Gaussian perception with physics-based dynamic modeling to generate physically realistic 4D videos.<n>We also present GSCollision, a 4D Gaussian dataset featuring diverse materials, multi-object interactions, and complex scenes, totaling over 640k rendered physical videos.
arXiv Detail & Related papers (2026-01-29T11:37:41Z)
Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis [53.48281548500864]
Motion 3-to-4 is a feed-forward framework for synthesising high-quality 4D dynamic objects from a single monocular video.<n>Our model learns a compact motion latent representation and predicts per-frame trajectories to recover complete robustness, temporally coherent geometry.
arXiv Detail & Related papers (2026-01-20T18:59:48Z)
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models [79.18306680174011]
DSR Suite bridges gap across aspects of dataset, benchmark and model.<n>We propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR.<n>The pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories.
arXiv Detail & Related papers (2025-12-23T17:56:36Z)
Advances in 4D Representation: Geometry, Motion, and Interaction [21.99533577912307]
We present a survey on 4D generation and reconstruction, a fast-evolving subfield of computer graphics.<n>We build our coverage of the domain from a unique and distinctive perspective of 4D representations.
arXiv Detail & Related papers (2025-10-22T05:22:20Z)
Understanding Dynamic Scenes in Ego Centric 4D Point Clouds [7.004204907286336]
EgoDynamic4D is a novel QA benchmark on highly dynamic scenes.<n>We design 12 dynamic QA tasks covering agent motion, human-object interaction prediction, relation, understanding trajectory, temporal-causal reasoning, with fine-grained, explicit metrics.<n>Our method consistently outperforms baselines, validating the effectiveness of multimodal temporal modeling for ego-centric dynamic scene understanding.
arXiv Detail & Related papers (2025-08-10T09:08:04Z)
Easi3R: Estimating Disentangled Motion from DUSt3R Without Training [48.87063562819018]
We introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. Our experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2025-03-31T17:59:58Z)
Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video [12.283639677279645]
We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling. Results show state-of-the-art performance in dynamic 4D modeling with superior visual quality.
arXiv Detail & Related papers (2025-03-27T17:57:32Z)
4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives [116.2042238179433]
In this paper, we frame dynamic scenes as unconstrained 4D volume learning problems. We represent a target dynamic scene using a collection of 4D Gaussian primitives with explicit geometry and appearance features. This approach can capture relevant information in space and time by fitting the underlying photorealistic-temporal volume. Notably, our 4DGS model is the first solution that supports real-time rendering of high-resolution, novel views for complex dynamic scenes.
arXiv Detail & Related papers (2024-12-30T05:30:26Z)
Bringing Objects to Life: training-free 4D generation from 3D objects through view consistent noise [31.533802484121182]
We introduce a training-free method for animating 3D objects by conditioning on textual prompts to guide 4D generation.<n>We first convert a 3D mesh into a 4D Neural Radiance Field (NeRF) that preserves the object's visual attributes.<n>Then, we animate the object using an Image-to-Video diffusion model driven by text.
arXiv Detail & Related papers (2024-12-29T10:12:01Z)
Phys4DGen: Physics-Compliant 4D Generation with Multi-Material Composition Perception [4.054634170768821]
Phys4DGen is a novel 4D generation framework that integrates multi-material composition perception with physical simulation. The framework achieves automated, physically plausible 4D generation through three innovative modules. Experiments on both synthetic and real-world datasets demonstrate that Phys4DGen can generate high-fidelity 4D content with physical realism.
arXiv Detail & Related papers (2024-11-25T12:12:38Z)
EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone. We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z)
DreamPhysics: Learning Physics-Based 3D Dynamics with Video Diffusion Priors [75.83647027123119]
We propose to learn the physical properties of a material field with video diffusion priors. We then utilize a physics-based Material-Point-Method simulator to generate 4D content with realistic motions.
arXiv Detail & Related papers (2024-06-03T16:05:25Z)
PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation [62.53760963292465]
PhysDreamer is a physics-based approach that endows static 3D objects with interactive dynamics. We present our approach on diverse examples of elastic objects and evaluate the realism of the synthesized interactions through a user study.
arXiv Detail & Related papers (2024-04-19T17:41:05Z)
3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive Physics under Challenging Scenes [68.66237114509264]
We present a framework capable of learning 3D-grounded visual intuitive physics models from videos of complex scenes with fluids. We show our model can make long-horizon future predictions by learning from raw images and significantly outperforms models that do not employ an explicit 3D representation space.
arXiv Detail & Related papers (2023-04-22T19:28:49Z)
Text-To-4D Dynamic Scene Generation [111.89517759596345]
We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions. Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency. The dynamic video output generated from the provided text can be viewed from any camera location and angle, and can be composited into any 3D environment.
arXiv Detail & Related papers (2023-01-26T18:14:32Z)
Learning Multi-Object Dynamics with Compositional Neural Radiance Fields [63.424469458529906]
We present a method to learn compositional predictive models from image observations based on implicit object encoders, Neural Radiance Fields (NeRFs), and graph neural networks. NeRFs have become a popular choice for representing scenes due to their strong 3D prior. For planning, we utilize RRTs in the learned latent space, where we can exploit our model and the implicit object encoder to make sampling the latent space informative and more efficient.
arXiv Detail & Related papers (2022-02-24T01:31:29Z)
Class-agnostic Reconstruction of Dynamic Objects from Videos [127.41336060616214]
We introduce REDO, a class-agnostic framework to REconstruct the Dynamic Objects from RGBD or calibrated videos. We develop two novel modules. First, we introduce a canonical 4D implicit function which is pixel-aligned with aggregated temporal visual cues. Second, we develop a 4D transformation module which captures object dynamics to support temporal propagation and aggregation.
arXiv Detail & Related papers (2021-12-03T18:57:47Z)
3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations. A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z)
Learning 3D Dynamic Scene Representations for Robot Manipulation [21.6131570689398]
3D scene representation for robot manipulation should capture three key object properties: permanency, completeness, and continuity. We introduce 3D Dynamic Representation (DSR), a 3D scene representation that simultaneously discovers, tracks, reconstructs objects, and predicts their dynamics. We propose DSR-Net, which learns to aggregate visual observations over multiple interactions to gradually build and refine DSR.
arXiv Detail & Related papers (2020-11-03T19:23:06Z)
Hindsight for Foresight: Unsupervised Structured Dynamics Models from Physical Interaction [24.72947291987545]
Key challenge for an agent learning to interact with the world is to reason about physical properties of objects. We propose a novel approach for modeling the dynamics of a robot's interactions directly from unlabeled 3D point clouds and images.
arXiv Detail & Related papers (2020-08-02T11:04:49Z)
3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans [27.747241700017728]
We present a unified representation for actionable spatial perception: 3D Dynamic Scene Graphs. 3D Dynamic Scene Graphs can have a profound impact on planning and decision-making, human-robot interaction, long-term autonomy, and scene prediction.
arXiv Detail & Related papers (2020-02-15T00:46:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.