Inferring Compositional 4D Scenes without Ever Seeing One
- URL: http://arxiv.org/abs/2512.05272v1
- Date: Thu, 04 Dec 2025 21:51:47 GMT
- Title: Inferring Compositional 4D Scenes without Ever Seeing One
- Authors: Ahmet Berke Gokmen, Ajad Chhatkuli, Luc Van Gool, Danda Pani Paudel,
- Abstract summary: We propose a method that consistently predicts structure and-temporal configuration of 4D/3D objects.<n>We achieve this by a carefully designed training of several spatial and temporal attentions on 2D video input.<n>By alternating between spatial and temporal reasoning, COM4D reconstructs complete and composed scenes.
- Score: 58.81854043690171
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard. Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, and single object dynamics throughout the video on the other, thus completely avoiding reliance on 4D compositional training data. At inference time, our proposed attention mixing mechanism combines these independently learned attentions, without requiring any 4D composition examples. By alternating between spatial and temporal reasoning, COM4D reconstructs complete and persistent 4D scenes with multiple interacting objects directly from monocular videos. Furthermore, COM4D provides state-of-the-art results in existing separate problems of 4D object and composed 3D reconstruction despite being purely data-driven.
Related papers
- 4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction [28.50411933478524]
We present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a persistent reconstruction of the scene.<n>In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps.
arXiv Detail & Related papers (2025-12-18T14:06:15Z) - SS4D: Native 4D Generative Model via Structured Spacetime Latents [50.29500511908054]
We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video.<n>We train a generator directly on 4D data, achieving high fidelity, temporal coherence, and structural consistency.
arXiv Detail & Related papers (2025-12-16T10:45:06Z) - C4D: 4D Made from 3D through Dual Correspondences [77.04731692213663]
We introduce C4D, a framework that leverages temporal correspondences to extend existing 3D reconstruction formulation to 4D.<n>C4D captures two types of correspondences: short-term optical flow and long-term point tracking.<n>We train a dynamic-aware point tracker that provides additional mobility information.
arXiv Detail & Related papers (2025-10-16T17:59:06Z) - Understanding Dynamic Scenes in Ego Centric 4D Point Clouds [7.004204907286336]
EgoDynamic4D is a novel QA benchmark on highly dynamic scenes.<n>We design 12 dynamic QA tasks covering agent motion, human-object interaction prediction, relation, understanding trajectory, temporal-causal reasoning, with fine-grained, explicit metrics.<n>Our method consistently outperforms baselines, validating the effectiveness of multimodal temporal modeling for ego-centric dynamic scene understanding.
arXiv Detail & Related papers (2025-08-10T09:08:04Z) - LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding [55.81291976637705]
We propose a general LMM framework with atemporal prompt for visual representation 4D scene understanding.<n>The prompt is generated by encoding 3D position and 1D time into dynamic-aware 4D coordinate embedding.<n>Experiments have been conducted to demonstrate the effectiveness of our method across different tasks in 4D scene understanding.
arXiv Detail & Related papers (2025-05-18T06:18:57Z) - Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency [49.875459658889355]
Free4D is a tuning-free framework for 4D scene generation from a single image.<n>Our key insight is to distill pre-trained foundation models for consistent 4D scene representation.<n>The resulting 4D representation enables real-time, controllable rendering.
arXiv Detail & Related papers (2025-03-26T17:59:44Z) - Comp4D: LLM-Guided Compositional 4D Scene Generation [65.5810466788355]
We present Comp4D, a novel framework for Compositional 4D Generation.
Unlike conventional methods that generate a singular 4D representation of the entire scene, Comp4D innovatively constructs each 4D object within the scene separately.
Our method employs a compositional score distillation technique guided by the pre-defined trajectories.
arXiv Detail & Related papers (2024-03-25T17:55:52Z) - Class-agnostic Reconstruction of Dynamic Objects from Videos [127.41336060616214]
We introduce REDO, a class-agnostic framework to REconstruct the Dynamic Objects from RGBD or calibrated videos.
We develop two novel modules. First, we introduce a canonical 4D implicit function which is pixel-aligned with aggregated temporal visual cues.
Second, we develop a 4D transformation module which captures object dynamics to support temporal propagation and aggregation.
arXiv Detail & Related papers (2021-12-03T18:57:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.