Reconstructing 4D Spatial Intelligence: A Survey
- URL: http://arxiv.org/abs/2507.21045v2
- Date: Sun, 03 Aug 2025 14:18:19 GMT
- Title: Reconstructing 4D Spatial Intelligence: A Survey
- Authors: Yukang Cao, Jiahao Lu, Zhisheng Huang, Zhuowen Shen, Chengfeng Zhao, Fangzhou Hong, Zhaoxi Chen, Xin Li, Wenping Wang, Yuan Liu, Ziwei Liu,
- Abstract summary: Reconstructing 4D spatial intelligence from visual observations has long been a central yet challenging task in computer vision.<n>We present a new perspective that organizes existing methods into five progressive levels of 4D spatial intelligence.
- Score: 57.8684548664209
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Reconstructing 4D spatial intelligence from visual observations has long been a central yet challenging task in computer vision, with broad real-world applications. These range from entertainment domains like movies, where the focus is often on reconstructing fundamental visual elements, to embodied AI, which emphasizes interaction modeling and physical realism. Fueled by rapid advances in 3D representations and deep learning architectures, the field has evolved quickly, outpacing the scope of previous surveys. Additionally, existing surveys rarely offer a comprehensive analysis of the hierarchical structure of 4D scene reconstruction. To address this gap, we present a new perspective that organizes existing methods into five progressive levels of 4D spatial intelligence: (1) Level 1 -- reconstruction of low-level 3D attributes (e.g., depth, pose, and point maps); (2) Level 2 -- reconstruction of 3D scene components (e.g., objects, humans, structures); (3) Level 3 -- reconstruction of 4D dynamic scenes; (4) Level 4 -- modeling of interactions among scene components; and (5) Level 5 -- incorporation of physical laws and constraints. We conclude the survey by discussing the key challenges at each level and highlighting promising directions for advancing toward even richer levels of 4D spatial intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/yukangcao/Awesome-4D-Spatial-Intelligence.
Related papers
- From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes [16.38713257618971]
Anywhere3D-Bench is a holistic 3D visual grounding benchmark consisting of 2,632 referring expression-3D bounding box pairs.<n>We assess a range of state-of-the-art 3D visual grounding methods alongside large language models.
arXiv Detail & Related papers (2025-06-05T11:28:02Z) - Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos [70.07088203106443]
Existing methods rely on explicit knowledge to learn motion, resulting in suboptimal representations.<n>Prior Masked Autoentangler (MAE) frameworks struggle to bridge the gap between low-level geometry and high-level dynamics in 4D data.<n>We propose a novel self-disentangled MAE for learning expressive,riminative, and transferable 4D representations.
arXiv Detail & Related papers (2025-04-07T08:47:36Z) - WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes [65.76371201992654]
We propose a novel 4D reconstruction benchmark, WideRange4D.<n>This benchmark includes rich 4D scene data with large spatial variations, allowing for a more comprehensive evaluation of the generation capabilities of 4D generation methods.<n>We also introduce a new 4D reconstruction method, Progress4D, which generates stable and high-quality 4D results across various complex 4D scene reconstruction tasks.
arXiv Detail & Related papers (2025-03-17T17:58:18Z) - Comp4D: LLM-Guided Compositional 4D Scene Generation [65.5810466788355]
We present Comp4D, a novel framework for Compositional 4D Generation.
Unlike conventional methods that generate a singular 4D representation of the entire scene, Comp4D innovatively constructs each 4D object within the scene separately.
Our method employs a compositional score distillation technique guided by the pre-defined trajectories.
arXiv Detail & Related papers (2024-03-25T17:55:52Z) - LoRD: Local 4D Implicit Representation for High-Fidelity Dynamic Human
Modeling [69.56581851211841]
We propose a novel Local 4D implicit Representation for Dynamic clothed human, named LoRD.
Our key insight is to encourage the network to learn the latent codes of local part-level representation.
LoRD has strong capability for representing 4D human, and outperforms state-of-the-art methods on practical applications.
arXiv Detail & Related papers (2022-08-18T03:49:44Z) - Class-agnostic Reconstruction of Dynamic Objects from Videos [127.41336060616214]
We introduce REDO, a class-agnostic framework to REconstruct the Dynamic Objects from RGBD or calibrated videos.
We develop two novel modules. First, we introduce a canonical 4D implicit function which is pixel-aligned with aggregated temporal visual cues.
Second, we develop a 4D transformation module which captures object dynamics to support temporal propagation and aggregation.
arXiv Detail & Related papers (2021-12-03T18:57:47Z) - 4D Attention: Comprehensive Framework for Spatio-Temporal Gaze Mapping [4.215251065887861]
This study presents a framework for capturing human attention in the gaze-temporal domain using eye-tracking glasses.
We estimate the pose by leveraging a loose coupling of direct visual localization and Inertial Measurement Unit (IMU) values.
By installing reconstruction components into our framework, dynamic objects not captured in the 3D environment are instantiated based on the input textures.
arXiv Detail & Related papers (2021-07-08T04:55:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.