Generalist Forecasting with Frozen Video Models via Latent Diffusion
- URL: http://arxiv.org/abs/2507.13942v1
- Date: Fri, 18 Jul 2025 14:14:19 GMT
- Title: Generalist Forecasting with Frozen Video Models via Latent Diffusion
- Authors: Jacob C Walker, Pedro Vélez, Luisa Polania Cabrera, Guangyao Zhou, Rishabh Kabra, Carl Doersch, Maks Ovsjanikov, João Carreira, Shiry Ginosar,
- Abstract summary: We show a strong correlation between a vision model's perceptual ability and its generalist forecasting performance over short time horizons.<n>Our results highlight the value of bridging representation learning and generative modeling for temporally grounded video understanding.
- Score: 35.96406989431198
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Forecasting what will happen next is a critical skill for general-purpose systems that plan or act in the world at different levels of abstraction. In this paper, we identify a strong correlation between a vision model's perceptual ability and its generalist forecasting performance over short time horizons. This trend holds across a diverse set of pretrained models-including those trained generatively-and across multiple levels of abstraction, from raw pixels to depth, point tracks, and object motion. The result is made possible by a novel generalist forecasting framework that operates on any frozen vision backbone: we train latent diffusion models to forecast future features in the frozen representation space, which are then decoded via lightweight, task-specific readouts. To enable consistent evaluation across tasks, we introduce distributional metrics that compare distributional properties directly in the space of downstream tasks and apply this framework to nine models and four tasks. Our results highlight the value of bridging representation learning and generative modeling for temporally grounded video understanding.
Related papers
- Multitask Learning with Stochastic Interpolants [13.301909784310894]
We propose a framework for learning maps between probability distributions that broadly generalizes the time dynamics of flow and diffusion models.<n>We generalize interpolants by replacing the scalar time variable with vectors, matrices, or linear operators.<n>This approach enables the construction of versatile generative models capable of fulfilling multiple tasks without task-specific training.
arXiv Detail & Related papers (2025-08-06T16:25:19Z) - Consistent World Models via Foresight Diffusion [56.45012929930605]
We argue that a key bottleneck in learning consistent diffusion-based world models lies in the suboptimal predictive ability.<n>We propose Foresight Diffusion (ForeDiff), a diffusion-based world modeling framework that enhances consistency by decoupling condition understanding from target denoising.
arXiv Detail & Related papers (2025-05-22T10:01:59Z) - Toward a Diffusion-Based Generalist for Dense Vision Tasks [141.03236279493686]
Recent works have shown image itself can be used as a natural interface for general-purpose visual perception.
We propose to perform diffusion in pixel space and provide a recipe for finetuning pre-trained text-to-image diffusion models for dense vision tasks.
In experiments, we evaluate our method on four different types of tasks and show competitive performance to the other vision generalists.
arXiv Detail & Related papers (2024-06-29T17:57:22Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Bridging Generative and Discriminative Models for Unified Visual
Perception with Diffusion Priors [56.82596340418697]
We propose a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors.
Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages.
The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.
arXiv Detail & Related papers (2024-01-29T10:36:57Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Predict, Refine, Synthesize: Self-Guiding Diffusion Models for
Probabilistic Time Series Forecasting [10.491628898499684]
We propose TSDiff, an unconditionally-trained diffusion model for time series.
Our proposed self-guidance mechanism enables conditioning TSDiff for downstream tasks during inference, without requiring auxiliary networks or altering the training procedure.
We demonstrate the effectiveness of our method on three different time series tasks: forecasting, refinement, and synthetic data generation.
arXiv Detail & Related papers (2023-07-21T10:56:36Z) - Unified Recurrence Modeling for Video Action Anticipation [16.240254363118016]
We propose a unified recurrence modeling for video action anticipation via message passing framework.
Our proposed method outperforms previous works on the large-scale EPIC-Kitchen dataset.
arXiv Detail & Related papers (2022-06-02T12:16:44Z) - Learning Long-term Visual Dynamics with Region Proposal Interaction
Networks [75.06423516419862]
We build object representations that can capture inter-object and object-environment interactions over a long-range.
Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin.
arXiv Detail & Related papers (2020-08-05T17:48:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.