Related papers: EVA: An Embodied World Model for Future Video Anticipation

EVA: An Embodied World Model for Future Video Anticipation

URL: http://arxiv.org/abs/2410.15461v2
Date: Tue, 10 Jun 2025 08:08:33 GMT
Title: EVA: An Embodied World Model for Future Video Anticipation
Authors: Xiaowei Chi, Chun-Kai Fan, Hengyuan Zhang, Xingqun Qi, Rongyu Zhang, Anthony Chen, Chi-min Chan, Wei Xue, Qifeng Liu, Shanghang Zhang, Yike Guo,
Abstract summary: Video generation models have made significant progress in simulating future states, showcasing their potential as world simulators in embodied scenarios.<n>Existing models often lack robust understanding, limiting their ability to perform multi-step predictions or handle Out-of-Distribution (OOD) scenarios.<n>We propose the Reflection of Generation (RoG), a set of intermediate reasoning strategies designed to enhance video prediction.
Score: 30.721105710709008
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video generation models have made significant progress in simulating future states, showcasing their potential as world simulators in embodied scenarios. However, existing models often lack robust understanding, limiting their ability to perform multi-step predictions or handle Out-of-Distribution (OOD) scenarios. To address this challenge, we propose the Reflection of Generation (RoG), a set of intermediate reasoning strategies designed to enhance video prediction. It leverages the complementary strengths of pre-trained vision-language and video generation models, enabling them to function as a world model in embodied scenarios. To support RoG, we introduce Embodied Video Anticipation Benchmark(EVA-Bench), a comprehensive benchmark that evaluates embodied world models across diverse tasks and scenarios, utilizing both in-domain and OOD datasets. Building on this foundation, we devise a world model, Embodied Video Anticipator (EVA), that follows a multistage training paradigm to generate high-fidelity video frames and apply an autoregressive strategy to enable adaptive generalization for longer video sequences. Extensive experiments demonstrate the efficacy of EVA in various downstream tasks like video generation and robotics, thereby paving the way for large-scale pre-trained models in real-world video prediction applications. The video demos are available at \hyperlink{https://sites.google.com/view/icml-eva}{https://sites.google.com/view/icml-eva}.

Related papers

VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation [23.86958300272144]
We introduce VDAWorld, a framework where a Vision-Language Model (VLM) acts as an intelligent agent to orchestrate this process.<n>The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools.<n>It can then infer latent dynamics from the static scene to predict plausible future states.
arXiv Detail & Related papers (2025-12-11T19:21:47Z)
VideoVLA: Video Generators Can Be Generalizable Robot Manipulators [86.70243911696616]
Generalization in robot manipulation is essential for deploying robots in open-world environments.<n>We present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators.
arXiv Detail & Related papers (2025-12-07T18:57:15Z)
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation [41.993197533574126]
Inferix is an inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes.<n>Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation.
arXiv Detail & Related papers (2025-11-25T01:45:04Z)
Can World Models Benefit VLMs for World Dynamics? [59.73433292793044]
We investigate the capabilities when world model priors are transferred into Vision-Language Models.<n>We name our best-performing variant Dynamic Vision Aligner (DyVA)<n>We find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance.
arXiv Detail & Related papers (2025-10-01T13:07:05Z)
From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models [65.0487600936788]
Video Diffusion Models (VDMs) have emerged as powerful generative tools capable of synthesizing high-quality content.<n>We argue that VDMs naturally push to probe structured representations and an implicit understanding of the visual world.<n>Our method transforms each task into a visual transition, enabling the training of LoRA weights on short input- sequences.
arXiv Detail & Related papers (2025-06-08T20:52:34Z)
Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis [12.160537328404622]
textttDRA-Ctrl provides new insights into reusing resource-intensive video models.<n>textttDRA-Ctrl lays foundation for future unified generative models across visual modalities.
arXiv Detail & Related papers (2025-05-29T10:34:45Z)
Programmatic Video Prediction Using Large Language Models [21.11346129620144]
ProgGen represents the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states.<n>Our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments.
arXiv Detail & Related papers (2025-05-20T22:17:47Z)
Vid2World: Crafting Video Diffusion Models to Interactive World Models [38.270098691244314]
Vid2World is a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models.<n>It performs casualization of a pre-trained video diffusion model by crafting its architecture and training objective to enable autoregressive generation.<n>It introduces a causal action guidance mechanism to enhance action controllability in the resulting interactive world model.
arXiv Detail & Related papers (2025-05-20T13:41:45Z)
AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM [1.7051307941715268]
Video anomaly detection (VAD) is crucial for video analysis and surveillance in computer vision. Existing VAD models rely on learned normal patterns, which makes them difficult to apply to diverse environments. This study proposes customizable video anomaly detection (C-VAD) technique and the AnyAnomaly model.
arXiv Detail & Related papers (2025-03-06T14:52:34Z)
Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators. To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module. Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z)
Video Creation by Demonstration [59.389591010842636]
We present $delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction.<n>By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process.<n> Empirically, $delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations.
arXiv Detail & Related papers (2024-12-12T18:41:20Z)
Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback [130.090296560882]
We investigate the use of feedback to enhance the object dynamics in text-to-video models. We show that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions.
arXiv Detail & Related papers (2024-12-03T17:44:23Z)
VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation [79.00294932026266]
VidMan is a novel framework that employs a two-stage training mechanism to enhance stability and improve data utilization efficiency. Our framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset.
arXiv Detail & Related papers (2024-11-14T03:13:26Z)
WorldSimBench: Towards Video Generation Models as World Simulators [79.69709361730865]
We classify the functionalities of predictive models into a hierarchy and take the first step in evaluating World Simulators by proposing a dual evaluation framework called WorldSimBench. WorldSimBench includes Explicit Perceptual Evaluation and Implicit Manipulative Evaluation, encompassing human preference assessments from the visual perspective and action-level evaluations in embodied tasks. Our comprehensive evaluation offers key insights that can drive further innovation in video generation models, positioning World Simulators as a pivotal advancement toward embodied artificial intelligence.
arXiv Detail & Related papers (2024-10-23T17:56:11Z)
DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model [65.43473733967038]
We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics. Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge.
arXiv Detail & Related papers (2024-10-14T17:19:23Z)
VideoPhy: Evaluating Physical Commonsense for Video Generation [93.28748850301949]
We present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities. We then generate videos conditioned on captions from diverse state-of-the-art text-to-video generative models. Our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts.
arXiv Detail & Related papers (2024-06-05T17:53:55Z)
iVideoGPT: Interactive VideoGPTs are Scalable World Models [70.02290687442624]
World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. This work introduces Interactive VideoGPT, a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations.
arXiv Detail & Related papers (2024-05-24T05:29:12Z)
Veagle: Advancements in Multimodal Representation Learning [0.0]
This paper introduces a novel approach to enhance the multimodal capabilities of existing models. Our proposed model Veagle, incorporates a unique mechanism inspired by the successes and insights of previous works. Our results indicate a improvement of 5-6 % in performance, with Veagle outperforming existing models by a notable margin.
arXiv Detail & Related papers (2024-01-18T12:45:25Z)
Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks. We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling. Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z)
HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames. We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.