Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video
- URL: http://arxiv.org/abs/2503.21761v1
- Date: Thu, 27 Mar 2025 17:57:32 GMT
- Title: Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video
- Authors: David Yifan Yao, Albert J. Zhai, Shenlong Wang,
- Abstract summary: We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling.<n>Results show state-of-the-art performance in dynamic 4D modeling with superior visual quality.
- Score: 12.283639677279645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.
Related papers
- VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control [83.92729346325163]
VerseCrafter is a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics.<n>Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud.<n>These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos.
arXiv Detail & Related papers (2026-01-08T17:28:52Z) - DynaPose4D: High-Quality 4D Dynamic Content Generation via Pose Alignment Loss [5.644194272935956]
DynaPose4D is a framework that generates high-quality 4D dynamic content from a single static image.<n>Results show that DynaPose4D achieves excellent coherence, consistency, and fluidity in dynamic motion generation.
arXiv Detail & Related papers (2025-10-26T01:11:13Z) - 4DNeX: Feed-Forward 4D Generative Modeling Made Easy [51.79072580042173]
We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image.<n>In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation.
arXiv Detail & Related papers (2025-08-18T17:59:55Z) - Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency [49.875459658889355]
Free4D is a tuning-free framework for 4D scene generation from a single image.
Our key insight is to distill pre-trained foundation models for consistent 4D scene representation.
The resulting 4D representation enables real-time, controllable rendering.
arXiv Detail & Related papers (2025-03-26T17:59:44Z) - CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models [98.03734318657848]
We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular video.<n>We leverage a multi-view video diffusion model trained on a diverse combination of datasets to enable novel view synthesis.<n>We demonstrate competitive performance on novel view synthesis and dynamic scene reconstruction benchmarks.
arXiv Detail & Related papers (2024-11-27T18:57:16Z) - SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency [37.96042037188354]
We present Stable Video 4D (SV4D), a latent video diffusion model for multi-frame and multi-view consistent dynamic 3D content generation.
arXiv Detail & Related papers (2024-07-24T17:59:43Z) - Animate3D: Animating Any 3D Model with Multi-view Video Diffusion [47.05131487114018]
Animate3D is a novel framework for animating any static 3D model.
We introduce a framework combining reconstruction and 4D Score Distillation Sampling (4D-SDS) to leverage the multi-view video diffusion priors for animating 3D objects.
arXiv Detail & Related papers (2024-07-16T05:35:57Z) - Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models [116.31344506738816]
We present a novel framework, textbfDiffusion4D, for efficient and scalable 4D content generation.
We develop a 4D-aware video diffusion model capable of synthesizing orbital views of dynamic 3D assets.
Our method surpasses prior state-of-the-art techniques in terms of generation efficiency and 4D geometry consistency.
arXiv Detail & Related papers (2024-05-26T17:47:34Z) - Comp4D: LLM-Guided Compositional 4D Scene Generation [65.5810466788355]
We present Comp4D, a novel framework for Compositional 4D Generation.
Unlike conventional methods that generate a singular 4D representation of the entire scene, Comp4D innovatively constructs each 4D object within the scene separately.
Our method employs a compositional score distillation technique guided by the pre-defined trajectories.
arXiv Detail & Related papers (2024-03-25T17:55:52Z) - Efficient4D: Fast Dynamic 3D Object Generation from a Single-view Video [42.10482273572879]
We propose an efficient video-to-4D object generation framework called Efficient4D.
It generates high-quality spacetime-consistent images under different camera views, and then uses them as labeled data.
Experiments on both synthetic and real videos show that Efficient4D offers a remarkable 10-fold increase in speed.
arXiv Detail & Related papers (2024-01-16T18:58:36Z) - Animate124: Animating One Image to 4D Dynamic Scene [108.17635645216214]
Animate124 is the first work to animate a single in-the-wild image into 3D video through textual motion descriptions.
Our method demonstrates significant advancements over existing baselines.
arXiv Detail & Related papers (2023-11-24T16:47:05Z) - H4D: Human 4D Modeling by Learning Neural Compositional Representation [75.34798886466311]
This work presents a novel framework that can effectively learn a compact and compositional representation for dynamic human.
A simple yet effective linear motion model is proposed to provide a rough and regularized motion estimation.
Experiments demonstrate our method is not only efficacy in recovering dynamic human with accurate motion and detailed geometry, but also amenable to various 4D human related tasks.
arXiv Detail & Related papers (2022-03-02T17:10:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.