Related papers: Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories

Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories

URL: http://arxiv.org/abs/2510.20182v1
Date: Thu, 23 Oct 2025 04:06:58 GMT
Title: Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories
Authors: Aaron Appelle, Jerome P. Lynch,
Abstract summary: We benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics.<n>A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters.
Score: 1.2676356746752893
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale video generation models have demonstrated high visual realism in diverse contexts, spurring interest in their potential as general-purpose world simulators. Existing benchmarks focus on individual subjects rather than scenes with multiple interacting people. However, the plausibility of multi-agent dynamics in generated videos remains unverified. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable comparison with a ground truth video dataset. For T2V, we develop a prompt suite to explore diverse pedestrian densities and interactions. A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters. Our analysis reveals that leading models have learned surprisingly effective priors for plausible multi-agent behavior. However, failure modes like merging and disappearing people highlight areas for future improvement.

Related papers

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning [66.51617619673587]
We present Skyra, a specialized large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos.<n>To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video dataset with fine-grained human annotations.<n>We then develop a two-stage training strategy that systematically enhances our model's artifact's-temporal perception, explanation capability, and detection accuracy.
arXiv Detail & Related papers (2025-12-17T18:48:26Z)
DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos [24.681248200255975]
Video models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation.<n>We present DRAW2ACT, a trajectory-conditioned video generation framework that extracts multiple representations from the input trajectory.<n>We show that DRAW2ACT achieves superior visual fidelity and consistency while yielding higher manipulation success rates compared to existing baselines.
arXiv Detail & Related papers (2025-12-16T09:11:36Z)
Can Image-To-Video Models Simulate Pedestrian Dynamics? [1.2676356746752893]
High-performing image-to-video (I2V) models based on variants of the diffusion transformer (DiT) have displayed remarkable world-modeling capabilities.<n>We investigate whether these models can generate realistic pedestrian movement patterns in crowded public scenes.
arXiv Detail & Related papers (2025-10-20T16:44:40Z)
AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes [63.055387623861094]
Recent Text-to-Video (T2V) models have demonstrated powerful capability in visual simulation of real-world geometry and physical laws.<n>We propose a two-stage paradigm to adapt pre-trained T2V models for viewpoint prediction.
arXiv Detail & Related papers (2025-10-12T15:55:44Z)
DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis [17.750053029702222]
Existing text-to-video (T2V) evaluation benchmarks, such as VBench and EvalCrafter, suffer from two limitations.<n>We introduce DynamicEval, a benchmark consisting of systematically curated prompts emphasizing dynamic camera motion.<n>For background scene consistency, we obtain the interpretable error maps based on the Vbench motion smoothness metric.<n>Our proposed metrics achieve stronger correlations with human preferences at both the video level and the model level.
arXiv Detail & Related papers (2025-10-08T18:41:04Z)
DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework [26.661935208583756]
virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment.<n>We propose DreamVVT, which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios.<n>In the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent try-on images.<n>In the second stage, skeleton maps together with fine-grained motion and appearance descriptions are
arXiv Detail & Related papers (2025-08-04T18:27:55Z)
VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z)
Direct Motion Models for Assessing Generated Videos [38.04485796547767]
A current limitation of video generative video models is that they generate plausible looking frames, but poor motion.<n>Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion.<n>We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data.
arXiv Detail & Related papers (2025-04-30T22:34:52Z)
Can Generative Video Models Help Pose Estimation? [42.10672365565019]
Pairwise pose estimation from images with little or no overlap is an open challenge in computer vision.<n>Inspired by the human ability to infer spatial relationships from diverse scenes, we propose a novel approach, InterPose.<n>We propose a video model to hallucinate intermediate frames between two input images, effectively creating a dense, visual transition.
arXiv Detail & Related papers (2024-12-20T18:58:24Z)
Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback [130.090296560882]
We investigate the use of feedback to enhance the object dynamics in text-to-video models.<n>We show that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions.
arXiv Detail & Related papers (2024-12-03T17:44:23Z)
InTraGen: Trajectory-controlled Video Generation for Object Interactions [100.79494904451246]
InTraGen is a pipeline for improved trajectory-based generation of object interaction scenarios. Our results demonstrate improvements in both visual fidelity and quantitative performance.
arXiv Detail & Related papers (2024-11-25T14:27:50Z)
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels. contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text. There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.