Related papers: Flow and Depth Assisted Video Prediction with Latent Transformer

Flow and Depth Assisted Video Prediction with Latent Transformer

URL: http://arxiv.org/abs/2511.16484v1
Date: Thu, 20 Nov 2025 15:54:33 GMT
Title: Flow and Depth Assisted Video Prediction with Latent Transformer
Authors: Eliyas Suleyman, Paul Henderson, Eksan Firkat, Nicolas Pugeault,
Abstract summary: We present the first systematic study dedicated to occluded video prediction.<n>We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow.<n>We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.
Score: 6.973908410173025
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video prediction is a fundamental task for various downstream applications, including robotics and world modeling. Although general video prediction models have achieved remarkable performance in standard scenarios, occlusion is still an inherent challenge in video prediction. We hypothesize that providing explicit information about motion (via point-flow) and geometric structure (via depth-maps) will enable video prediction models to perform better in situations with occlusion and the background motion. To investigate this, we present the first systematic study dedicated to occluded video prediction. We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow. We evaluate this model in a controlled setting on both synthetic and real-world datasets with not only appearance-based metrics but also Wasserstein distances on object masks, which can effectively measure the motion distribution of the prediction. We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.

Related papers

Future Optical Flow Prediction Improves Robot Control & Video Generation [100.87884718953099]
We introduce FOFPred, a novel optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture.<n>Our model is trained on web-scale human activity data-a highly scalable but unstructured source.<n> Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred.
arXiv Detail & Related papers (2026-01-15T18:49:48Z)
Autoregressive Flow Matching for Motion Prediction [14.914156964274897]
Autoregressive flow matching (ARFM) is a new method for probabilistic modeling of sequential continuous data.<n>We develop benchmarks for evaluating the ability of motion prediction models to predict human and robot motion.<n>Our model is able to predict complex motions, and we demonstrate that conditioning robot action prediction and human motion prediction on predicted future tracks can significantly improve downstream task performance.
arXiv Detail & Related papers (2025-12-27T19:35:45Z)
Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers [3.951575888190684]
This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various selftemporal-attention layouts.<n>We introduce a simple yet effective transformer for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction horizon.
arXiv Detail & Related papers (2025-10-23T17:58:45Z)
What Happens Next? Anticipating Future Motion by Generating Point Trajectories [76.16266402727643]
We consider the problem of forecasting motion from a single image, predicting how objects in the world are likely to move.<n>We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators.<n>This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators.
arXiv Detail & Related papers (2025-09-25T21:03:56Z)
Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past. We leverage the large-scale pretraining of image diffusion models which can handle multi-modality. We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z)
OFMPNet: Deep End-to-End Model for Occupancy and Flow Prediction in Urban Environment [0.0]
We introduce an end-to-end neural network methodology designed to predict the future behaviors of all dynamic objects in the environment. We propose a novel time-weighted motion flow loss, whose application has shown a substantial decrease in end-point error.
arXiv Detail & Related papers (2024-04-02T19:37:58Z)
A Control-Centric Benchmark for Video Prediction [69.22614362800692]
We propose a benchmark for action-conditioned video prediction in the form of a control benchmark. Our benchmark includes simulated environments with 11 task categories and 310 task instance definitions. We then leverage our benchmark to study the effects of scaling model size, quantity of training data, and model ensembling.
arXiv Detail & Related papers (2023-04-26T17:59:45Z)
STDepthFormer: Predicting Spatio-temporal Depth from Video with a Self-supervised Transformer Model [0.0]
Self-supervised model simultaneously predicts a sequence of future frames from video-input with a spatial-temporal attention network is proposed. The proposed model leverages prior scene knowledge such as object shape and texture similar to single-image depth inference methods. It is implicitly capable of forecasting the motion of objects in the scene, rather than requiring complex models involving multi-object detection, segmentation and tracking.
arXiv Detail & Related papers (2023-03-02T12:22:51Z)
HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames. We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z)
Conditioned Human Trajectory Prediction using Iterative Attention Blocks [70.36888514074022]
We present a simple yet effective pedestrian trajectory prediction model aimed at pedestrians positions prediction in urban-like environments. Our model is a neural-based architecture that can run several layers of attention blocks and transformers in an iterative sequential fashion. We show that without explicit introduction of social masks, dynamical models, social pooling layers, or complicated graph-like structures, it is possible to produce on par results with SoTA models.
arXiv Detail & Related papers (2022-06-29T07:49:48Z)
FitVid: Overfitting in Pixel-Level Video Prediction [117.59339756506142]
We introduce a new architecture, named FitVid, which is capable of severe overfitting on the common benchmarks. FitVid outperforms the current state-of-the-art models across four different video prediction benchmarks on four different metrics.
arXiv Detail & Related papers (2021-06-24T17:20:21Z)
Motion Segmentation using Frequency Domain Transformer Networks [29.998917158604694]
We propose a novel end-to-end learnable architecture that predicts the next frame by modeling foreground and background separately. Our approach can outperform some widely used video prediction methods like Video Ladder Network and Predictive Gated Pyramids on synthetic data.
arXiv Detail & Related papers (2020-04-18T15:05:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.