Can World Models Benefit VLMs for World Dynamics?
- URL: http://arxiv.org/abs/2510.00855v1
- Date: Wed, 01 Oct 2025 13:07:05 GMT
- Title: Can World Models Benefit VLMs for World Dynamics?
- Authors: Kevin Zhang, Kuangzhi Ge, Xiaowei Chi, Renrui Zhang, Shaojun Shi, Zhen Dong, Sirui Han, Shanghang Zhang,
- Abstract summary: We investigate the capabilities when world model priors are transferred into Vision-Language Models.<n>We name our best-performing variant Dynamic Vision Aligner (DyVA)<n>We find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance.
- Score: 59.73433292793044
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Trained on internet-scale video data, generative world models are increasingly recognized as powerful world simulators that can generate consistent and plausible dynamics over structure, motion, and physics. This raises a natural question: with the advent of strong video foundational models, might they supplant conventional vision encoder paradigms for general-purpose multimodal understanding? While recent studies have begun to explore the potential of world models on common vision tasks, these explorations typically lack a systematic investigation of generic, multimodal tasks. In this work, we strive to investigate the capabilities when world model priors are transferred into Vision-Language Models: we re-purpose a video diffusion model as a generative encoder to perform a single denoising step and treat the resulting latents as a set of visual embedding. We empirically investigate this class of models, which we refer to as World-Language Models (WorldLMs), and we find that generative encoders can capture latents useful for downstream understanding that show distinctions from conventional encoders. Naming our best-performing variant Dynamic Vision Aligner (DyVA), we further discover that this method significantly enhances spatial reasoning abilities and enables single-image models to perform multi-frame reasoning. Through the curation of a suite of visual reasoning tasks, we find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance. We attribute these gains to WorldLM's inherited motion-consistency internalization from video pre-training. Finally, we systematically explore extensive model designs to highlight promising directions for future work. We hope our study can pave the way for a new family of VLMs that leverage priors from world models and are on a promising path towards generalist vision learners.
Related papers
- ReWorld: Multi-Dimensional Reward Modeling for Embodied World Models [27.729654985554372]
ReWorld is a framework aimed to employ reinforcement learning to align the video-based embodied world models with physical realism, task completion capability, embodiment plausibility and visual quality.<n>We show that ReWorld significantly improves the physical fidelity, logical coherence, embodiment and visual quality of generated rollouts, outperforming previous methods.
arXiv Detail & Related papers (2026-01-18T14:27:10Z) - Continual Learning for Generative AI: From LLMs to MLLMs and Beyond [56.29231194002407]
We present a comprehensive survey of continual learning methods for mainstream generative AI models.<n>We categorize these approaches into three paradigms: architecture-based, regularization-based, and replay-based.<n>We analyze continual learning setups for different generative models, including training objectives, benchmarks, and core backbones.
arXiv Detail & Related papers (2025-06-16T02:27:25Z) - From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models [65.0487600936788]
Video Diffusion Models (VDMs) have emerged as powerful generative tools capable of synthesizing high-quality content.<n>We argue that VDMs naturally push to probe structured representations and an implicit understanding of the visual world.<n>Our method transforms each task into a visual transition, enabling the training of LoRA weights on short input- sequences.
arXiv Detail & Related papers (2025-06-08T20:52:34Z) - Vid2World: Crafting Video Diffusion Models to Interactive World Models [35.42362065437052]
We present Vid2World, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models.<n>Our method offers a scalable and effective pathway for repurposing highly capable video diffusion models into interactive world models.
arXiv Detail & Related papers (2025-05-20T13:41:45Z) - EVA: An Embodied World Model for Future Video Anticipation [30.721105710709008]
Video generation models have made significant progress in simulating future states, showcasing their potential as world simulators in embodied scenarios.<n>Existing models often lack robust understanding, limiting their ability to perform multi-step predictions or handle Out-of-Distribution (OOD) scenarios.<n>We propose the Reflection of Generation (RoG), a set of intermediate reasoning strategies designed to enhance video prediction.
arXiv Detail & Related papers (2024-10-20T18:24:00Z) - iVideoGPT: Interactive VideoGPTs are Scalable World Models [70.02290687442624]
World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making.
This work introduces Interactive VideoGPT, a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens.
iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations.
arXiv Detail & Related papers (2024-05-24T05:29:12Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning.
We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.