Related papers: DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

URL: http://arxiv.org/abs/2512.03000v2
Date: Wed, 03 Dec 2025 18:51:37 GMT
Title: DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling
Authors: Kairun Wen, Yuzhi Huang, Runyu Chen, Hui Zheng, Yunlong Lin, Panwang Pan, Chenxin Li, Wenyan Cong, Jian Zhang, Junbin Lu, Chenguo Lin, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Yue Huang, Xinghao Ding, Rakesh Ranjan, Zhiwen Fan,
Abstract summary: We introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video.<n>We employ vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic captions.<n> DynamicVerse delivers a large-scale dataset consisting of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos.
Score: 67.95038177144554
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consisting of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.

Related papers

VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control [83.92729346325163]
VerseCrafter is a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics.<n>Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud.<n>These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos.
arXiv Detail & Related papers (2026-01-08T17:28:52Z)
TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model [53.555353366322464]
We present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system.<n>Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible synthesis systems.
arXiv Detail & Related papers (2025-12-31T18:31:46Z)
PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model [23.768571323272152]
PartRM is a novel 4D reconstruction framework that simultaneously models appearance, geometry, and part-level motion from multi-view images of a static object.<n>We introduce the PartDrag-4D dataset, providing multi-view observations of part-level dynamics across over 20,000 states.<n> Experimental results show that PartRM establishes a new state-of-the-art in part-level motion learning and can be applied in manipulation tasks in robotics.
arXiv Detail & Related papers (2025-03-25T17:59:58Z)
4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives [115.67081491747943]
Dynamic 3D scene representation and novel view synthesis are crucial for enabling AR/VR and metaverse applications.<n>We reformulate the reconstruction of a time-varying 3D scene as approximating its underlying 4D volume.<n>We derive several compact variants that effectively reduce the memory footprint to address its storage bottleneck.
arXiv Detail & Related papers (2024-12-30T05:30:26Z)
PhysFlow: Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation [9.306758077479472]
PhysFlow is a novel approach that leverages multi-modal foundation models and video diffusion to achieve enhanced 4D dynamic scene simulation.<n>This integrated framework enables accurate prediction and realistic simulation of dynamic interactions in real-world scenarios.
arXiv Detail & Related papers (2024-11-21T18:55:23Z)
DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and Depth from Monocular Videos [76.01906393673897]
We propose a self-supervised method to jointly learn 3D motion and depth from monocular videos. Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion. Our model delivers superior performance in all evaluated settings.
arXiv Detail & Related papers (2024-03-09T12:22:46Z)
AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space. We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z)
Real-time Deep Dynamic Characters [95.5592405831368]
We propose a deep videorealistic 3D human character model displaying highly realistic shape, motion, and dynamic appearance. We use a novel graph convolutional network architecture to enable motion-dependent deformation learning of body and clothing. We show that our model creates motion-dependent surface deformations, physically plausible dynamic clothing deformations, as well as video-realistic surface textures at a much higher level of detail than previous state of the art approaches.
arXiv Detail & Related papers (2021-05-04T23:28:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.