SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
- URL: http://arxiv.org/abs/2603.02133v2
- Date: Tue, 03 Mar 2026 14:00:37 GMT
- Title: SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
- Authors: Chong Xia, Kai Zhu, Zizhuo Wang, Fangfu Liu, Zhizheng Zhang, Yueqi Duan,
- Abstract summary: Compositional scene reconstruction seeks to create object-centric representations rather than holistic scenes from real-world videos.<n>We propose SimRecon, a framework that realizes a "Perception-Generation-Simulation" pipeline towards cluttered scene reconstruction.
- Score: 32.616029685189744
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Compositional scene reconstruction seeks to create object-centric representations rather than holistic scenes from real-world videos, which is natively applicable for simulation and interaction. Conventional compositional reconstruction approaches primarily emphasize on visual appearance and show limited generalization ability to real-world scenarios. In this paper, we propose SimRecon, a framework that realizes a "Perception-Generation-Simulation" pipeline towards cluttered scene reconstruction, which first conducts scene-level semantic reconstruction from video input, then performs single-object generation, and finally assembles these assets in the simulator. However, naively combining these three stages leads to visual infidelity of generated assets and physical implausibility of the final scene, a problem particularly severe for complex scenes. Thus, we further propose two bridging modules between the three stages to address this problem. To be specific, for the transition from Perception to Generation, critical for visual fidelity, we introduce Active Viewpoint Optimization, which actively searches in 3D space to acquire optimal projected images as conditions for single-object completion. Moreover, for the transition from Generation to Simulation, essential for physical plausibility, we propose a Scene Graph Synthesizer, which guides the construction from scratch in 3D simulators, mirroring the native, constructive principle of the real world. Extensive experiments on the ScanNet dataset validate our method's superior performance over previous state-of-the-art approaches.
Related papers
- Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization [27.083888910311984]
Estimating simulation-ready scenes from real-world observations is crucial for downstream planning and policy learning tasks.<n>Existing methods struggle in cluttered environments.<n>We propose a unified optimization-based formulation for real-to-sim scene estimation.
arXiv Detail & Related papers (2026-02-23T18:58:24Z) - SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge [22.64986854574998]
Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding.<n>We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image.
arXiv Detail & Related papers (2025-12-01T12:51:56Z) - HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video [25.898073594115413]
HoloScene is a novel interactive 3D reconstruction framework.<n>It encodes object geometry, appearance, and physical properties alongside hierarchical and inter-object relationships.<n>The resulting digital twins exhibit complete and precise geometry, physical stability, and realistic rendering from novel viewpoints.
arXiv Detail & Related papers (2025-10-07T04:12:18Z) - IGFuse: Interactive 3D Gaussian Scene Reconstruction via Multi-Scans Fusion [15.837932667195037]
IGFuse is a novel framework that reconstructs interactive Gaussian scene by fusing observations from multiple scans.<n>Our method constructs segmentation aware Gaussian fields and enforces bi-directional photometric and semantic consistency across scans.<n>IGFuse enables high fidelity rendering and object level scene manipulation without dense observations or complex pipelines.
arXiv Detail & Related papers (2025-08-18T17:59:47Z) - HumanGenesis: Agent-Based Geometric and Generative Modeling for Synthetic Human Dynamics [60.737929335600015]
We present textbfHumanGenesis, a framework that integrates geometric and generative modeling through four collaborative agents.<n>HumanGenesis achieves state-of-the-art performance on tasks including text-guided synthesis, video reenactment, and novel-pose generalization.
arXiv Detail & Related papers (2025-08-13T14:50:19Z) - Spatial Understanding from Videos: Structured Prompts Meet Simulation Data [89.77871049500546]
We present a unified framework for enhancing 3D spatial reasoning in pre-trained vision-language models without modifying their architecture.<n>This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes.
arXiv Detail & Related papers (2025-06-04T07:36:33Z) - HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation [50.206100327643284]
HiScene is a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation.<n>We generate 3D content that aligns with 2D representations while maintaining compositional structure.
arXiv Detail & Related papers (2025-04-17T16:33:39Z) - SimVS: Simulating World Inconsistencies for Robust View Synthesis [102.83898965828621]
We present an approach for leveraging generative video models to simulate the inconsistencies in the world that can occur during capture.<n>We demonstrate that our world-simulation strategy significantly outperforms traditional augmentation methods in handling real-world scene variations.
arXiv Detail & Related papers (2024-12-10T17:35:12Z) - Reconstructing Interactive 3D Scenes by Panoptic Mapping and CAD Model
Alignments [81.38641691636847]
We rethink the problem of scene reconstruction from an embodied agent's perspective.
We reconstruct an interactive scene using RGB-D data stream.
This reconstructed scene replaces the object meshes in the dense panoptic map with part-based articulated CAD models.
arXiv Detail & Related papers (2021-03-30T05:56:58Z) - GeoSim: Photorealistic Image Simulation with Geometry-Aware Composition [81.24107630746508]
We present GeoSim, a geometry-aware image composition process that synthesizes novel urban driving scenes.
We first build a diverse bank of 3D objects with both realistic geometry and appearance from sensor data.
The resulting synthetic images are photorealistic, traffic-aware, and geometrically consistent, allowing image simulation to scale to complex use cases.
arXiv Detail & Related papers (2021-01-16T23:00:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.