Related papers: Mirage2Matter: A Physically Grounded Gaussian World Model from Video

Mirage2Matter: A Physically Grounded Gaussian World Model from Video

URL: http://arxiv.org/abs/2602.00096v1
Date: Sat, 24 Jan 2026 07:43:57 GMT
Title: Mirage2Matter: A Physically Grounded Gaussian World Model from Video
Authors: Zhengqing Gao, Ziwen Li, Xin Wang, Jiaxin Huang, Zhenyang Ren, Mingkai Shao, Hanlue Zhang, Tianyu Huang, Yongkang Cheng, Yandong Guo, Runqi Lin, Yuanyuan Wang, Tongliang Liu, Kun Zhang, Mingming Gong,
Abstract summary: We present Simulate Anything, a graphics-driven world modeling and simulation framework.<n>Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS)<n>We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target.
Score: 87.9732484393686
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The scalability of embodied intelligence is fundamentally constrained by the scarcity of real-world interaction data. While simulation platforms provide a promising alternative, existing approaches often suffer from a substantial visual and physical gap to real environments and rely on expensive sensors, precise robot calibration, or depth measurements, limiting their practicality at scale. We present Simulate Anything, a graphics-driven world modeling and simulation framework that enables efficient generation of high-fidelity embodied training data using only multi-view environment videos and off-the-shelf assets. Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS), seamlessly capturing fine-grained geometry and appearance from video. We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target, enabling accurate scale alignment between the reconstructed scene and the real world. Together, these components provide a unified, editable, and physically grounded world model. Vision Language Action (VLA) models trained on our simulated data achieve strong zero-shot performance on downstream tasks, matching or even surpassing results obtained with real-world data, highlighting the potential of reconstruction-driven world modeling for scalable and practical embodied intelligence training.

Related papers

D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping [66.22412592525369]
We introduce a real-to-sim-to-real engine that leverages the Gaussian Splat representations to build a differentiable engine.<n>We show that our engine achieves accurate and robust performance in mass identification across various object geometries and mass values.<n>Those optimized mass values facilitate force-aware policy learning, achieving superior and high performance in object grasping.
arXiv Detail & Related papers (2026-03-01T15:32:04Z)
DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer [62.18680935878919]
We introduce DiffusionHarmonizer, an online generative enhancement framework that transforms renderings into temporally consistent outputs.<n>At its core is a single-step temporally-conditioned enhancer capable of running in online simulators on a single GPU.
arXiv Detail & Related papers (2026-02-27T15:35:30Z)
HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video [25.898073594115413]
HoloScene is a novel interactive 3D reconstruction framework.<n>It encodes object geometry, appearance, and physical properties alongside hierarchical and inter-object relationships.<n>The resulting digital twins exhibit complete and precise geometry, physical stability, and realistic rendering from novel viewpoints.
arXiv Detail & Related papers (2025-10-07T04:12:18Z)
EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device [33.22697339175522]
Embodied AI predominantly relies on simulation for training and evaluation.<n>Sim-to-real transfer remains a major challenge.<n>EmbodiedSplat is a novel approach that personalizes policy training.
arXiv Detail & Related papers (2025-09-22T07:22:31Z)
GWM: Towards Scalable Gaussian World Models for Robotic Manipulation [53.51622803589185]
We propose a novel branch of world model named Gaussian World Model (GWM) for robotic manipulation.<n>At its core is a latent Diffusion Transformer (DiT) combined with a 3D variational autoencoder, enabling fine-grained scene-level future state reconstruction.<n>Both simulated and real-world experiments depict that GWM can precisely predict future scenes conditioned on diverse robot actions.
arXiv Detail & Related papers (2025-08-25T02:01:09Z)
DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos [52.46386528202226]
We introduce the Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM)<n>It is the first feed-forward method predicting deformable 3D Gaussian splats from a monocular posed video of any dynamic scene.<n>It achieves performance on par with state-of-the-art monocular video 3D tracking methods.
arXiv Detail & Related papers (2025-06-11T17:59:58Z)
R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation [78.26308457952636]
This paper introduces R3D2, a lightweight, one-step diffusion model designed to overcome limitations in autonomous driving simulation.<n>It enables realistic insertion of complete 3D assets into existing scenes by generating plausible rendering effects-such as shadows and consistent lighting-in real time.<n>We show that R3D2 significantly enhances the realism of inserted assets, enabling use-cases like text-to-3D asset insertion and cross-scene/dataset object transfer.
arXiv Detail & Related papers (2025-06-09T14:50:19Z)
Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.<n>To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.<n> Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z)
URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images [39.0780707100513]
We present an integrated end-to-end pipeline that generates simulation scenes complete with articulated kinematic and dynamic structures from real-world images. In doing so, our work provides both a pipeline for large-scale generation of simulation environments and an integrated system for training robust robotic control policies.
arXiv Detail & Related papers (2024-05-19T20:01:29Z)
Reconstructing Objects in-the-wild for Realistic Sensor Simulation [41.55571880832957]
We present NeuSim, a novel approach that estimates accurate geometry and realistic appearance from sparse in-the-wild data. We model the object appearance with a robust physics-inspired reflectance representation effective for in-the-wild data. Our experiments show that NeuSim has strong view synthesis performance on challenging scenarios with sparse training views.
arXiv Detail & Related papers (2023-11-09T18:58:22Z)
Learning from synthetic data generated with GRADE [0.6982738885923204]
We present a framework for generating realistic animated dynamic environments (GRADE) for robotics research. GRADE supports full simulation control, ROS integration, realistic physics, while being in an engine that produces high visual fidelity images and ground truth data. We show that, even training using only synthetic data, can generalize well to real-world images in the same application domain.
arXiv Detail & Related papers (2023-05-07T14:13:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.