Simulating the Real World: A Unified Survey of Multimodal Generative Models
- URL: http://arxiv.org/abs/2503.04641v1
- Date: Thu, 06 Mar 2025 17:31:43 GMT
- Title: Simulating the Real World: A Unified Survey of Multimodal Generative Models
- Authors: Yuqi Hu, Longguang Wang, Xian Liu, Ling-Hao Chen, Yuwei Guo, Yukai Shi, Ce Liu, Anyi Rao, Zeyu Wang, Hui Xiong,
- Abstract summary: We present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation.<n>To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework.
- Score: 48.35284571052435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images), videos, 3D, and 4D representations, as independent domains, overlooking their interdependencies. Additionally, these methods typically focus on isolated dimensions of reality without systematically integrating their connections. In this survey, we present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation. Specifically, this survey starts from 2D generation (appearance), then moves to video (appearance+dynamics) and 3D generation (appearance+geometry), and finally culminates in 4D generation that integrate all dimensions. To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework. To guide future research, we provide a comprehensive review of datasets, evaluation metrics and future directions, and fostering insights for newcomers. This survey serves as a bridge to advance the study of multimodal generative models and real-world simulation within a unified framework.
Related papers
- Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors [31.277540988829976]
This paper proposes a novel zero-shot HOI synthesis framework without relying on end-to-end training on currently limited 3D HOI datasets.
We employ pre-trained human pose estimation models to extract human poses and introduce a generalizable category-level 6-DoF estimation method to obtain the object poses from 2D HOI images.
arXiv Detail & Related papers (2025-03-25T23:55:47Z) - Advances in 4D Generation: A Survey [20.285058992203442]
4D generation focuses on creating dynamic 3D assets with consistency based on user input.
We summarize five major challenges of 4D generation: consistency, controllability, diversity, efficiency, and fidelity.
We provide an in-depth discussion of the obstacles currently hindering the development of the 4D generation.
arXiv Detail & Related papers (2025-03-18T17:59:51Z) - Generative Physical AI in Vision: A Survey [78.07014292304373]
Gene Artificial Intelligence (AI) has rapidly advanced the field of computer vision by enabling machines to create and interpret visual data with unprecedented sophistication.
This transformation builds upon a foundation of generative models to produce realistic images, videos, and 3D/4D content.
As generative models evolve to increasingly integrate physical realism and dynamic simulation, their potential to function as "world simulators" expands.
arXiv Detail & Related papers (2025-01-19T03:19:47Z) - Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey [61.39993881402787]
World models and video generation are pivotal technologies in the domain of autonomous driving.
This paper investigates the relationship between these two technologies.
By analyzing the interplay between video generation and world models, this survey identifies critical challenges and future research directions.
arXiv Detail & Related papers (2024-11-05T08:58:35Z) - Diffusion Models in 3D Vision: A Survey [11.116658321394755]
We review the state-of-the-art approaches that leverage diffusion models for 3D visual tasks.
These approaches include 3D object generation, shape completion, point cloud reconstruction, and scene understanding.
We discuss potential solutions, including improving computational efficiency, enhancing multimodal fusion, and exploring the use of large-scale pretraining.
arXiv Detail & Related papers (2024-10-07T04:12:23Z) - 3D-VLA: A 3D Vision-Language-Action Generative World Model [68.0388311799959]
Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world.
We propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action.
Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments.
arXiv Detail & Related papers (2024-03-14T17:58:41Z) - Towards Unified 3D Object Detection via Algorithm and Data Unification [70.27631528933482]
We build the first unified multi-modal 3D object detection benchmark MM- Omni3D and extend the aforementioned monocular detector to its multi-modal version.
We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively.
arXiv Detail & Related papers (2024-02-28T18:59:31Z) - Towards AI-Architecture Liberty: A Comprehensive Survey on Design and Generation of Virtual Architecture by Deep Learning [23.58793497403681]
3D shape generation techniques leveraging deep learning have garnered significant interest from both the computer vision and architectural design communities.
We review 149 related articles covering architectural design, 3D shape techniques, and virtual environments.
We highlight four important enablers of ubiquitous interaction with immersive systems in deep learning-assisted architectural generation.
arXiv Detail & Related papers (2023-04-30T15:38:36Z) - Deep Generative Models on 3D Representations: A Survey [81.73385191402419]
Generative models aim to learn the distribution of observed data by generating new instances.
Recently, researchers started to shift focus from 2D to 3D space.
representing 3D data poses significantly greater challenges.
arXiv Detail & Related papers (2022-10-27T17:59:50Z) - Recovering 3D Human Mesh from Monocular Images: A Survey [49.00136388529404]
Estimating human pose and shape from monocular images is a long-standing problem in computer vision.
This survey focuses on the task of monocular 3D human mesh recovery.
arXiv Detail & Related papers (2022-03-03T18:56:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.