Can Image-To-Video Models Simulate Pedestrian Dynamics?
- URL: http://arxiv.org/abs/2510.17731v1
- Date: Mon, 20 Oct 2025 16:44:40 GMT
- Title: Can Image-To-Video Models Simulate Pedestrian Dynamics?
- Authors: Aaron Appelle, Jerome P. Lynch,
- Abstract summary: High-performing image-to-video (I2V) models based on variants of the diffusion transformer (DiT) have displayed remarkable world-modeling capabilities.<n>We investigate whether these models can generate realistic pedestrian movement patterns in crowded public scenes.
- Score: 1.2676356746752893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent high-performing image-to-video (I2V) models based on variants of the diffusion transformer (DiT) have displayed remarkable inherent world-modeling capabilities by virtue of training on large scale video datasets. We investigate whether these models can generate realistic pedestrian movement patterns in crowded public scenes. Our framework conditions I2V models on keyframes extracted from pedestrian trajectory benchmarks, then evaluates their trajectory prediction performance using quantitative measures of pedestrian dynamics.
Related papers
- MAD: Motion Appearance Decoupling for efficient Driving World Models [94.40548866741791]
We propose an efficient adaptation framework that converts generalist video models into controllable driving world models.<n>Key idea is to decouple motion learning from appearance synthesis.<n>Scaling to LTX, our MAD-LTX model outperforms all open-source competitors.
arXiv Detail & Related papers (2026-01-14T12:52:23Z) - Autoregressive Flow Matching for Motion Prediction [14.914156964274897]
Autoregressive flow matching (ARFM) is a new method for probabilistic modeling of sequential continuous data.<n>We develop benchmarks for evaluating the ability of motion prediction models to predict human and robot motion.<n>Our model is able to predict complex motions, and we demonstrate that conditioning robot action prediction and human motion prediction on predicted future tracks can significantly improve downstream task performance.
arXiv Detail & Related papers (2025-12-27T19:35:45Z) - FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos [109.99404241220039]
We introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets.<n>Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models.<n>We fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance.
arXiv Detail & Related papers (2025-12-11T18:53:15Z) - Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories [1.2676356746752893]
We benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics.<n>A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters.
arXiv Detail & Related papers (2025-10-23T04:06:58Z) - Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.<n>To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.<n> Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z) - Predicting 3D representations for Dynamic Scenes [29.630985082164383]
We present a novel framework for dynamic radiance field prediction given monocular video streams.<n>Our method goes a step further by generating explicit 3D representations of the dynamic scene.<n>We find that our approach emerges capabilities for geometry and semantic learning.
arXiv Detail & Related papers (2025-01-28T01:31:15Z) - DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers [61.92571851411509]
We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning.<n>Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.
arXiv Detail & Related papers (2024-12-24T18:59:37Z) - VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation [79.00294932026266]
VidMan is a novel framework that employs a two-stage training mechanism to enhance stability and improve data utilization efficiency.
Our framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset.
arXiv Detail & Related papers (2024-11-14T03:13:26Z) - AVID: Adapting Video Diffusion Models to World Models [10.757223474031248]
We propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model.
AVID uses a learned mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos.
We evaluate AVID on video game and real-world robotics data, and show that it outperforms existing baselines for diffusion model adaptation.
arXiv Detail & Related papers (2024-10-01T13:48:31Z) - Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution [82.38677987249348]
We present the Qwen2-VL Series, which redefines the conventional predetermined-resolution approach in visual processing.
Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens.
The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos.
arXiv Detail & Related papers (2024-09-18T17:59:32Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.