Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories
- URL: http://arxiv.org/abs/2510.20182v1
- Date: Thu, 23 Oct 2025 04:06:58 GMT
- Title: Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories
- Authors: Aaron Appelle, Jerome P. Lynch,
- Abstract summary: We benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics.<n>A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters.
- Score: 1.2676356746752893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale video generation models have demonstrated high visual realism in diverse contexts, spurring interest in their potential as general-purpose world simulators. Existing benchmarks focus on individual subjects rather than scenes with multiple interacting people. However, the plausibility of multi-agent dynamics in generated videos remains unverified. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable comparison with a ground truth video dataset. For T2V, we develop a prompt suite to explore diverse pedestrian densities and interactions. A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters. Our analysis reveals that leading models have learned surprisingly effective priors for plausible multi-agent behavior. However, failure modes like merging and disappearing people highlight areas for future improvement.
Related papers
- Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning [66.51617619673587]
We present Skyra, a specialized large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos.<n>To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video dataset with fine-grained human annotations.<n>We then develop a two-stage training strategy that systematically enhances our model's artifact's-temporal perception, explanation capability, and detection accuracy.
arXiv Detail & Related papers (2025-12-17T18:48:26Z) - DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos [24.681248200255975]
Video models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation.<n>We present DRAW2ACT, a trajectory-conditioned video generation framework that extracts multiple representations from the input trajectory.<n>We show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.
arXiv Detail & Related papers (2025-12-16T09:11:36Z) - Can Image-To-Video Models Simulate Pedestrian Dynamics? [1.2676356746752893]
High-performing image-to-video (I2V) models based on variants of the diffusion transformer (DiT) have displayed remarkable world-modeling capabilities.<n>We investigate whether these models can generate realistic pedestrian movement patterns in crowded public scenes.
arXiv Detail & Related papers (2025-10-20T16:44:40Z) - AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes [63.055387623861094]
Recent Text-to-Video (T2V) models have demonstrated powerful capability in visual simulation of real-world geometry and physical laws.<n>We propose a two-stage paradigm to adapt pre-trained T2V models for viewpoint prediction.
arXiv Detail & Related papers (2025-10-12T15:55:44Z) - DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis [17.750053029702222]
Existing text-to-video (T2V) evaluation benchmarks, such as VBench and EvalCrafter, suffer from two limitations.<n>We introduce DynamicEval, a benchmark consisting of systematically curated prompts emphasizing dynamic camera motion.<n>For background scene consistency, we obtain the interpretable error maps based on the Vbench motion smoothness metric.<n>Our proposed metrics achieve stronger correlations with human preferences at both the video level and the model level.
arXiv Detail & Related papers (2025-10-08T18:41:04Z) - DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework [26.661935208583756]
virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment.<n>We propose DreamVVT, which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios.<n>In the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent try-on images.<n>In the second stage, skeleton maps together with fine-grained motion and appearance descriptions are
arXiv Detail & Related papers (2025-08-04T18:27:55Z) - VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z) - Direct Motion Models for Assessing Generated Videos [38.04485796547767]
A current limitation of video generative video models is that they generate plausible looking frames, but poor motion.<n>Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion.<n>We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data.
arXiv Detail & Related papers (2025-04-30T22:34:52Z) - Can Generative Video Models Help Pose Estimation? [42.10672365565019]
Pairwise pose estimation from images with little or no overlap is an open challenge in computer vision.<n>Inspired by the human ability to infer spatial relationships from diverse scenes, we propose a novel approach, InterPose.<n>We propose a video model to hallucinate intermediate frames between two input images, effectively creating a dense, visual transition.
arXiv Detail & Related papers (2024-12-20T18:58:24Z) - Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback [130.090296560882]
We investigate the use of feedback to enhance the object dynamics in text-to-video models.<n>We show that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions.
arXiv Detail & Related papers (2024-12-03T17:44:23Z) - InTraGen: Trajectory-controlled Video Generation for Object Interactions [100.79494904451246]
InTraGen is a pipeline for improved trajectory-based generation of object interaction scenarios.
Our results demonstrate improvements in both visual fidelity and quantitative performance.
arXiv Detail & Related papers (2024-11-25T14:27:50Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.