Video models are zero-shot learners and reasoners
- URL: http://arxiv.org/abs/2509.20328v2
- Date: Mon, 29 Sep 2025 20:44:46 GMT
- Title: Video models are zero-shot learners and reasoners
- Authors: Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, Robert Geirhos,
- Abstract summary: Veo 3 can solve a broad variety of tasks it wasn't explicitly trained for.<n>Veo's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.
- Score: 33.694362486721865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today's generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding? We demonstrate that Veo 3 can solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.
Related papers
- Thinker: A vision-language foundation model for embodied intelligence [9.661713829767605]
We propose Thinker, a large vision-language foundation model for embodied intelligence.<n>We construct a large-scale dataset tailored for robotic perception and reasoning.<n>We introduce a simple yet effective approach that substantially enhances the model's capacity for video comprehension.
arXiv Detail & Related papers (2026-01-29T02:52:08Z) - Can World Models Benefit VLMs for World Dynamics? [59.73433292793044]
We investigate the capabilities when world model priors are transferred into Vision-Language Models.<n>We name our best-performing variant Dynamic Vision Aligner (DyVA)<n>We find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance.
arXiv Detail & Related papers (2025-10-01T13:07:05Z) - From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models [65.0487600936788]
Video Diffusion Models (VDMs) have emerged as powerful generative tools capable of synthesizing high-quality content.<n>We argue that VDMs naturally push to probe structured representations and an implicit understanding of the visual world.<n>Our method transforms each task into a visual transition, enabling the training of LoRA weights on short input- sequences.
arXiv Detail & Related papers (2025-06-08T20:52:34Z) - Language Model Guided Interpretable Video Action Reasoning [32.999621421295416]
We present a new framework named Language-guided Interpretable Action Recognition framework (LaIAR)
LaIAR leverages knowledge from language models to enhance both the recognition capabilities and the interpretability of video models.
In essence, we redefine the problem of understanding video model decisions as a task of aligning video and language models.
arXiv Detail & Related papers (2024-04-02T02:31:13Z) - Self-supervised learning of video representations from a child's perspective [27.439294457852423]
Children learn powerful internal models of the world around them from a few years of egocentric visual experience.
Can such internal models be learned from a child's visual experience with highly generic learning algorithms or do they require strong inductive biases?
arXiv Detail & Related papers (2024-02-01T03:27:26Z) - A Vision Check-up for Language Models [61.852026871772914]
We show how a preliminary visual representation learning system can be trained using models of text.
Experiments on self-supervised visual representation learning highlight the potential to train vision models capable of making semantic assessments of natural images.
arXiv Detail & Related papers (2024-01-03T18:09:33Z) - GPT4Image: Large Pre-trained Models Help Vision Models Learn Better on Perception Task [47.1857510710807]
We present a new learning framework, dubbed GPT4Image, where the knowledge of the large pre-trained models are extracted to help CNNs and ViTs learn better representations.<n>We conduct extensive experiments to verify the effectiveness of the proposed algorithm on various visual perception tasks.
arXiv Detail & Related papers (2023-06-01T14:02:45Z) - Composing Ensembles of Pre-trained Models via Iterative Consensus [95.10641301155232]
We propose a unified framework for composing ensembles of different pre-trained models.
We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization.
We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer.
arXiv Detail & Related papers (2022-10-20T18:46:31Z) - What Language Model Architecture and Pretraining Objective Work Best for
Zero-Shot Generalization? [50.84738303888189]
We present a large-scale evaluation of modeling choices and their impact on zero-shot generalization.
We train models with over 5 billion parameters for more than 170 billion tokens.
We find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models.
arXiv Detail & Related papers (2022-04-12T14:19:49Z) - Unsupervised Object Learning via Common Fate [61.14802390241075]
Learning generative object models from unlabelled videos is a long standing problem and required for causal scene modeling.
We decompose this problem into three easier subtasks, and provide candidate solutions for each of them.
We show that our approach allows learning generative models that generalize beyond the occlusions present in the input videos.
arXiv Detail & Related papers (2021-10-13T08:22:04Z) - Learning Video Models from Text: Zero-Shot Anticipation for Procedural
Actions [30.88621433812347]
This paper presents a hierarchical model that generalizes instructional knowledge from large-scale text-corpora and transfers the knowledge to video.
Given a portion of an instructional video, our model recognizes and predicts coherent and plausible actions multiple steps into the future, all in rich natural language.
arXiv Detail & Related papers (2021-06-06T15:43:39Z) - CAZSL: Zero-Shot Regression for Pushing Models by Generalizing Through
Context [13.217582954907234]
We study the problem of designing deep learning agents which can generalize their models of the physical world by building context-aware models.
We present context-aware zero shot learning (CAZSL, pronounced as casual) models, an approach utilizing a Siamese network, embedding space and regularization based on context variables.
We test our proposed learning algorithm on the recently released Omnipush datatset that allows testing of meta-learning capabilities.
arXiv Detail & Related papers (2020-03-26T01:21:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.