Bridging the Gap Between Multimodal Foundation Models and World Models
- URL: http://arxiv.org/abs/2510.03727v1
- Date: Sat, 04 Oct 2025 08:14:20 GMT
- Title: Bridging the Gap Between Multimodal Foundation Models and World Models
- Authors: Xuehai He,
- Abstract summary: We investigate what it takes to bridge the gap between multimodal foundation models and world models.<n>Our approaches incorporate scene graphs, multimodal conditioning and alignment strategies to guide the generation process.<n>We extend these techniques to controllable 4D generation, enabling interactive, editable, and morphable object synthesis over time and space.
- Score: 10.001347956177879
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans understand the world through the integration of multiple sensory modalities, enabling them to perceive, reason about, and imagine dynamic physical processes. Inspired by this capability, multimodal foundation models (MFMs) have emerged as powerful tools for multimodal understanding and generation. However, today's MFMs fall short of serving as effective world models. They lack the essential ability such as perform counterfactual reasoning, simulate dynamics, understand the spatiotemporal information, control generated visual outcomes, and perform multifaceted reasoning. We investigates what it takes to bridge the gap between multimodal foundation models and world models. We begin by improving the reasoning capabilities of MFMs through discriminative tasks and equipping MFMs with structured reasoning skills, such as causal inference, counterfactual thinking, and spatiotemporal reasoning, enabling them to go beyond surface correlations and understand deeper relationships within visual and textual data. Next, we explore generative capabilities of multimodal foundation models across both image and video modalities, introducing new frameworks for structured and controllable generation. Our approaches incorporate scene graphs, multimodal conditioning, and multimodal alignment strategies to guide the generation process, ensuring consistency with high-level semantics and fine-grained user intent. We further extend these techniques to controllable 4D generation, enabling interactive, editable, and morphable object synthesis over time and space.
Related papers
- Graph4MM: Weaving Multimodal Learning with Structural Information [52.16646463590474]
Graphs provide powerful structural information for modeling intra- and inter-modal relationships.<n>Previous works fail to distinguish multi-hop neighbors and treat the graph as a standalone modality.<n>We propose Graph4MM, a graph-based multimodal learning framework.
arXiv Detail & Related papers (2025-10-19T20:13:03Z) - NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching [64.10695425442164]
We introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms.<n>Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks.<n>To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
arXiv Detail & Related papers (2025-10-15T16:25:18Z) - A Survey of Generative Categories and Techniques in Multimodal Large Language Models [3.7507324448128876]
Multimodal Large Language Models (MLLMs) have rapidly evolved beyond text generation.<n>This survey categorises six primary generative modalities and examines how foundational techniques enable cross-modal capabilities.
arXiv Detail & Related papers (2025-05-29T12:29:39Z) - A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models [74.48084001058672]
The rise of foundation models has transformed machine learning research.<n> multimodal foundation models (MMFMs) pose unique interpretability challenges beyond unimodal frameworks.<n>This survey explores two key aspects: (1) the adaptation of LLM interpretability methods to multimodal models and (2) understanding the mechanistic differences between unimodal language models and crossmodal systems.
arXiv Detail & Related papers (2025-02-22T20:55:26Z) - Multi-modal Generative AI: Multi-modal LLMs, Diffusions and the Unification [41.88402339122694]
Multi-modal generative AI (Artificial Intelligence) has attracted increasing attention from both academia and industry.<n>This paper provides a comprehensive overview of multi-modal generative AI, including multi-modal LLMs, diffusions, and the unification for understanding and generation.
arXiv Detail & Related papers (2024-09-23T13:16:09Z) - Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives [56.2139730920855]
We present a systematic analysis of MM-VUFMs specifically designed for road scenes.
Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques.
We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
arXiv Detail & Related papers (2024-02-05T12:47:09Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - Foundation Models for Decision Making: Problems, Methods, and
Opportunities [124.79381732197649]
Foundation models pretrained on diverse data at scale have demonstrated extraordinary capabilities in a wide range of vision and language tasks.
New paradigms are emerging for training foundation models to interact with other agents and perform long-term reasoning.
Research at the intersection of foundation models and decision making holds tremendous promise for creating powerful new systems.
arXiv Detail & Related papers (2023-03-07T18:44:07Z) - DIME: Fine-grained Interpretations of Multimodal Models via Disentangled
Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models.
Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.