Related papers: VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation

VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation

URL: http://arxiv.org/abs/2512.11061v1
Date: Thu, 11 Dec 2025 19:21:47 GMT
Title: VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation
Authors: Felix O'Mahony, Roberto Cipolla, Ayush Tewari,
Abstract summary: We introduce VDAWorld, a framework where a Vision-Language Model (VLM) acts as an intelligent agent to orchestrate this process.<n>The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools.<n>It can then infer latent dynamics from the static scene to predict plausible future states.
Score: 23.86958300272144
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill-suited for building structured, queryable worlds. To overcome these challenges, we propose a new paradigm focused on distilling an image caption pair into a tractable, abstract representation optimized for simulation. We introduce VDAWorld, a framework where a Vision-Language Model (VLM) acts as an intelligent agent to orchestrate this process. The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools, and accordingly chooses a compatible physics simulator (e.g., rigid body, fluid) to act upon it. VDAWorld can then infer latent dynamics from the static scene to predict plausible future states. Our experiments show that this combination of intelligent abstraction and adaptive simulation results in a versatile world model capable of producing high quality simulations across a wide range of dynamic scenarios.

Related papers

Mirage2Matter: A Physically Grounded Gaussian World Model from Video [87.9732484393686]
We present Simulate Anything, a graphics-driven world modeling and simulation framework.<n>Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS)<n>We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target.
arXiv Detail & Related papers (2026-01-24T07:43:57Z)
SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models [60.80050275581661]
Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities.<n>They lack a grounded understanding of physical dynamics.<n>We present S, a test-time, SIMulation-enabled ACTion Planning framework.<n>Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks.
arXiv Detail & Related papers (2025-12-05T18:51:03Z)
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation [41.993197533574126]
Inferix is an inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes.<n>Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation.
arXiv Detail & Related papers (2025-11-25T01:45:04Z)
PAN: A World Model for General, Interactable, and Long-Horizon World Simulation [49.805071498152536]
We introduce PAN, a general, interactable, and long-horizon world model.<n>It predicts future world states through high-quality video simulation conditioned on history and natural language actions.<n>Experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning.
arXiv Detail & Related papers (2025-11-12T07:20:35Z)
Simulating the Visual World with Artificial Intelligence: A Roadmap [48.64639618440864]
Video generation is shifting from generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility.<n>This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components.<n>We trace the progression of video generation through four generations, culminating in a video generation model that embodies intrinsic physical plausibility.
arXiv Detail & Related papers (2025-11-11T18:59:50Z)
Can World Models Benefit VLMs for World Dynamics? [59.73433292793044]
We investigate the capabilities when world model priors are transferred into Vision-Language Models.<n>We name our best-performing variant Dynamic Vision Aligner (DyVA)<n>We find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance.
arXiv Detail & Related papers (2025-10-01T13:07:05Z)
Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.<n>To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.<n> Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z)
EVA: An Embodied World Model for Future Video Anticipation [30.721105710709008]
Video generation models have made significant progress in simulating future states, showcasing their potential as world simulators in embodied scenarios.<n>Existing models often lack robust understanding, limiting their ability to perform multi-step predictions or handle Out-of-Distribution (OOD) scenarios.<n>We propose the Reflection of Generation (RoG), a set of intermediate reasoning strategies designed to enhance video prediction.
arXiv Detail & Related papers (2024-10-20T18:24:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.