Related papers: From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation

From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation

URL: http://arxiv.org/abs/2510.00806v1
Date: Wed, 01 Oct 2025 12:11:36 GMT
Title: From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation
Authors: Fan Yang, Zhiyang Chen, Yousong Zhu, Xin Li, Jinqiao Wang,
Abstract summary: TrajVLM-Gen is a framework for physics-aware image-to-video generation.<n>We employ a Vision Language Model to predict coarse-grained motion trajectories that maintain consistency with real-world physics.<n>We build a trajectory prediction dataset based on video tracking data with realistic motion patterns.
Score: 33.41681612310823
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current video generation models produce physically inconsistent motion that violates real-world dynamics. We propose TrajVLM-Gen, a two-stage framework for physics-aware image-to-video generation. First, we employ a Vision Language Model to predict coarse-grained motion trajectories that maintain consistency with real-world physics. Second, these trajectories guide video generation through attention-based mechanisms for fine-grained motion refinement. We build a trajectory prediction dataset based on video tracking data with realistic motion patterns. Experiments on UCF-101 and MSR-VTT demonstrate that TrajVLM-Gen outperforms existing methods, achieving competitive FVD scores of 545 on UCF-101 and 539 on MSR-VTT.

Related papers

Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement [51.54051161067026]
We propose an iterative self-refinement framework to provide physics-aware guidance for video generation.<n>We introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies.<n>Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38.
arXiv Detail & Related papers (2025-11-25T13:09:03Z)
PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection [10.498184571108995]
We propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation.<n>Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions.<n>Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones.
arXiv Detail & Related papers (2025-11-06T02:40:57Z)
Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation [80.89133198952187]
PhysHPO is a novel framework for Hierarchical Cross-Modal Direct Preference Optimization.<n>It enables fine-grained preference alignment for physically plausible video generation.<n>We show that PhysHPO significantly improves physical plausibility and overall video generation quality of advanced models.
arXiv Detail & Related papers (2025-08-14T17:30:37Z)
Physics-Grounded Motion Forecasting via Equation Discovery for Trajectory-Guided Image-to-Video Generation [54.42523027597904]
We introduce a novel framework that integrates symbolic regression and trajectory-guided image-to-video (I2V) models for physics-grounded video forecasting.<n>Our approach extracts motion trajectories from input videos, uses a retrieval-based pre-training mechanism to enhance symbolic regression, and discovers equations of motion to forecast physically accurate future trajectories.
arXiv Detail & Related papers (2025-07-09T13:28:42Z)
SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation [56.90807453045657]
SynMotion is a motion-customized video generation model that jointly leverages semantic guidance and visual adaptation.<n>At the semantic level, we introduce the dual-em semantic comprehension mechanism which disentangles subject and motion representations.<n>At the visual level, we integrate efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence.
arXiv Detail & Related papers (2025-06-30T10:09:32Z)
Programmatic Video Prediction Using Large Language Models [21.11346129620144]
ProgGen represents the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states.<n>Our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments.
arXiv Detail & Related papers (2025-05-20T22:17:47Z)
VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior [88.51778468222766]
Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos.<n>VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics.<n>We propose a novel two-stage image-to-video generation framework that explicitly incorporates physics with vision and language informed physical prior.
arXiv Detail & Related papers (2025-03-30T09:03:09Z)
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models [71.9811050853964]
VideoJAM is a novel framework that instills an effective motion prior to video generators.<n>VideoJAM achieves state-of-the-art performance in motion coherence.<n>These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation.
arXiv Detail & Related papers (2025-02-04T17:07:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.