Programmatic Video Prediction Using Large Language Models
- URL: http://arxiv.org/abs/2505.14948v1
- Date: Tue, 20 May 2025 22:17:47 GMT
- Title: Programmatic Video Prediction Using Large Language Models
- Authors: Hao Tang, Kevin Ellis, Suhas Lohit, Michael J. Jones, Moitreya Chatterjee,
- Abstract summary: ProgGen represents the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states.<n>Our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments.
- Score: 21.11346129620144
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of estimating the world model describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes. For applications such as video surveillance, robotics applications, autonomous driving, etc. this objective entails synthesizing plausible visual futures, given a few frames of a video to set the visual context. Towards this end, we propose ProgGen, which undertakes the task of video frame prediction by representing the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states (one per frame) by leveraging the inductive biases of Large (Vision) Language Models (LLM/VLM). In particular, ProgGen utilizes LLM/VLM to synthesize programs: (i) to estimate the states of the video, given the visual context (i.e. the frames); (ii) to predict the states corresponding to future time steps by estimating the transition dynamics; (iii) to render the predicted states as visual RGB-frames. Empirical evaluations reveal that our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments: (i) PhyWorld (ii) Cart Pole. Additionally, ProgGen permits counter-factual reasoning and interpretable video generation attesting to its effectiveness and generalizability for video generation tasks.
Related papers
- Light Future: Multimodal Action Frame Prediction via InstructPix2Pix [0.0]
This paper proposes a novel, efficient, and lightweight approach for robot action prediction.<n>It offers significantly reduced computational cost and inference latency compared to conventional video prediction models.<n>It pioneers the adaptation of the InstructPix2Pix model for forecasting future visual frames in robotic tasks.
arXiv Detail & Related papers (2025-07-20T03:57:18Z) - VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior [88.51778468222766]
Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos.<n>VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics.<n>We propose a novel two-stage image-to-video generation framework that explicitly incorporates physics with vision and language informed physical prior.
arXiv Detail & Related papers (2025-03-30T09:03:09Z) - Object-Centric Image to Video Generation with Language Guidance [17.50161162624179]
TextOCVP is an object-centric model for image-to-video generation guided by textual descriptions.<n>Our approach jointly models object dynamics and interactions while incorporating textual guidance, thus leading to accurate and controllable predictions.
arXiv Detail & Related papers (2025-02-17T10:46:47Z) - EVA: An Embodied World Model for Future Video Anticipation [42.937348053592636]
We decompose the complex video prediction into four meta-tasks that enable the world model to handle this issue in a more fine-grained manner.
We introduce a new benchmark named Embodied Video Anticipation Benchmark (EVA-Bench) to provide a well-rounded evaluation.
We propose the Embodied Video Anticipator (EVA), a unified framework aiming at video understanding and generation.
arXiv Detail & Related papers (2024-10-20T18:24:00Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Let's Think Frame by Frame with VIP: A Video Infilling and Prediction
Dataset for Evaluating Video Chain-of-Thought [62.619076257298204]
We motivate framing video reasoning as the sequential understanding of a small number of video reasonings.
We introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought.
We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in complex video reasoning tasks, and encourage future work.
arXiv Detail & Related papers (2023-05-23T10:26:42Z) - Object-Centric Video Prediction via Decoupling of Object Dynamics and
Interactions [27.112210225969733]
We propose a novel framework for the task of object-centric video prediction, i.e., extracting the structure of a video sequence, as well as modeling objects dynamics and interactions from visual observations.
With the goal of learning meaningful object representations, we propose two object-centric video predictor (OCVP) transformer modules, which de-couple processing of temporal dynamics and object interactions.
In our experiments, we show how our object-centric prediction framework utilizing our OCVP predictors outperforms object-agnostic video prediction models on two different datasets.
arXiv Detail & Related papers (2023-02-23T08:29:26Z) - Learning Universal Policies via Text-Guided Video Generation [179.6347119101618]
A goal of artificial intelligence is to construct an agent that can solve a wide variety of tasks.
Recent progress in text-guided image synthesis has yielded models with an impressive ability to generate complex novel images.
We investigate whether such tools can be used to construct more general-purpose agents.
arXiv Detail & Related papers (2023-01-31T21:28:13Z) - Motion and Context-Aware Audio-Visual Conditioned Video Prediction [58.9467115916639]
We decouple the audio-visual conditioned video prediction into motion and appearance modeling.
The multimodal motion estimation predicts future optical flow based on the audio-motion correlation.
We propose context-aware refinement to address the diminishing of the global appearance context.
arXiv Detail & Related papers (2022-12-09T05:57:46Z) - Learning Semantic-Aware Dynamics for Video Prediction [68.04359321855702]
We propose an architecture and training scheme to predict video frames by explicitly modeling dis-occlusions.
The appearance of the scene is warped from past frames using the predicted motion in co-visible regions.
arXiv Detail & Related papers (2021-04-20T05:00:24Z) - Mutual Information Based Method for Unsupervised Disentanglement of
Video Representation [0.0]
Video prediction models have found prospective applications in Maneuver Planning, Health care, Autonomous Navigation and Simulation.
One of the major challenges in future frame generation is due to the high dimensional nature of visual data.
We propose Mutual Information Predictive Auto-Encoder framework, that reduces the task of predicting high dimensional video frames.
arXiv Detail & Related papers (2020-11-17T13:16:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.