VFMF: World Modeling by Forecasting Vision Foundation Model Features
- URL: http://arxiv.org/abs/2512.11225v1
- Date: Fri, 12 Dec 2025 02:10:05 GMT
- Title: VFMF: World Modeling by Forecasting Vision Foundation Model Features
- Authors: Gabrijel Boduljak, Yushi Lan, Christian Rupprecht, Andrea Vedaldi,
- Abstract summary: We introduce a generative forecaster that performs autoregressive flow matching in vision foundation models feature space.<n>We show that this latent information more effectively than previously used PCA-based alternatives, both for forecasting and other applications.<n>With matched architecture and compute, our method produces sharper and more accurate predictions than regression across all modalities.
- Score: 67.09340259579761
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Forecasting from partial observations is central to world modeling. Many recent methods represent the world through images, and reduce forecasting to stochastic video generation. Although such methods excel at realism and visual fidelity, predicting pixels is computationally intensive and not directly useful in many applications, as it requires translating RGB into signals useful for decision making. An alternative approach uses features from vision foundation models (VFMs) as world representations, performing deterministic regression to predict future world states. These features can be directly translated into actionable signals such as semantic segmentation and depth, while remaining computationally efficient. However, deterministic regression averages over multiple plausible futures, undermining forecast accuracy by failing to capture uncertainty. To address this crucial limitation, we introduce a generative forecaster that performs autoregressive flow matching in VFM feature space. Our key insight is that generative modeling in this space requires encoding VFM features into a compact latent space suitable for diffusion. We show that this latent space preserves information more effectively than previously used PCA-based alternatives, both for forecasting and other applications, such as image generation. Our latent predictions can be easily decoded into multiple useful and interpretable output modalities: semantic segmentation, depth, surface normals, and even RGB. With matched architecture and compute, our method produces sharper and more accurate predictions than regression across all modalities. Our results suggest that stochastic conditional generation of VFM features offers a promising and scalable foundation for future world models.
Related papers
- Future Optical Flow Prediction Improves Robot Control & Video Generation [100.87884718953099]
We introduce FOFPred, a novel optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture.<n>Our model is trained on web-scale human activity data-a highly scalable but unstructured source.<n> Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred.
arXiv Detail & Related papers (2026-01-15T18:49:48Z) - Towards Agnostic and Holistic Universal Image Segmentation with Bit Diffusion [9.184659875364689]
This paper introduces a diffusion-based framework for universal image segmentation.<n>We show that a location-aware palette with our 2D gray code ordering improves performance.<n>We believe that combining our proposed improvements with large-scale pretraining or promptable conditioning could lead to competitive models.
arXiv Detail & Related papers (2026-01-06T10:07:14Z) - Visual Autoregressive Modelling for Monocular Depth Estimation [69.01449528371916]
We propose a monocular depth estimation method based on visual autoregressive ( VAR) priors.<n>Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism.<n>We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets.
arXiv Detail & Related papers (2025-12-27T17:08:03Z) - Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model [32.831576387973875]
We propose a two-stage deterministic framework for stable, accurate and fine-grained geometric dense prediction.<n>Specifically, in the first stage, the core predictor employs a single-step deterministic formulation with a clean-data objective.<n>In the second stage, the detail sharpener performs a constrained multi-step rectified-flow refinement within the manifold defined by the core predictor.
arXiv Detail & Related papers (2025-11-30T18:57:25Z) - Agentic World Modeling for 6G: Near-Real-Time Generative State-Space Reasoning [70.56067503630486]
We argue that sixth-generation (6G) intelligence is not fluent token prediction but calibrated the capacity to imagine and choose.<n>We show that WM-MS3M cuts mean absolute error (MAE) by 1.69% versus MS3M with 32% fewer parameters and similar latency, and achieves 35-80% lower root mean squared error (RMSE) than attention/hybrid baselines with 2.3-4.1x faster inference.
arXiv Detail & Related papers (2025-11-04T17:22:22Z) - A Time-Series Foundation Model by Universal Delay Embedding [4.221753069966852]
This study introduces Universal Delay Embedding (UDE), a pretrained foundation model designed to revolutionize time-series forecasting.<n>UDE as a dynamical representation of observed data constructs two-dimensional subspace patches from Hankel matrices.<n>In particular, the learned dynamical representations and Koopman operator prediction forms from the patches exhibit exceptional interpretability.
arXiv Detail & Related papers (2025-09-15T16:11:49Z) - LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation [45.02469804709771]
We propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling.<n>We show that LaDi-WM significantly enhances policy performance by 27.9% on the LIBERO-LONG benchmark and 20% on the real-world scenario.
arXiv Detail & Related papers (2025-05-13T04:42:14Z) - Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction [88.65168366064061]
We introduce Discrete Denoising Posterior Prediction (DDPP), a novel framework that casts the task of steering pre-trained MDMs as a problem of probabilistic inference.
Our framework leads to a family of three novel objectives that are all simulation-free, and thus scalable.
We substantiate our designs via wet-lab validation, where we observe transient expression of reward-optimized protein sequences.
arXiv Detail & Related papers (2024-10-10T17:18:30Z) - PDC-Net+: Enhanced Probabilistic Dense Correspondence Network [161.76275845530964]
Enhanced Probabilistic Dense Correspondence Network, PDC-Net+, capable of estimating accurate dense correspondences.
We develop an architecture and an enhanced training strategy tailored for robust and generalizable uncertainty prediction.
Our approach obtains state-of-the-art results on multiple challenging geometric matching and optical flow datasets.
arXiv Detail & Related papers (2021-09-28T17:56:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.