Learning Video Models from Text: Zero-Shot Anticipation for Procedural
Actions
- URL: http://arxiv.org/abs/2106.03158v1
- Date: Sun, 6 Jun 2021 15:43:39 GMT
- Title: Learning Video Models from Text: Zero-Shot Anticipation for Procedural
Actions
- Authors: Fadime Sener, Rishabh Saraf, Angela Yao
- Abstract summary: This paper presents a hierarchical model that generalizes instructional knowledge from large-scale text-corpora and transfers the knowledge to video.
Given a portion of an instructional video, our model recognizes and predicts coherent and plausible actions multiple steps into the future, all in rich natural language.
- Score: 30.88621433812347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Can we teach a robot to recognize and make predictions for activities that it
has never seen before? We tackle this problem by learning models for video from
text. This paper presents a hierarchical model that generalizes instructional
knowledge from large-scale text-corpora and transfers the knowledge to video.
Given a portion of an instructional video, our model recognizes and predicts
coherent and plausible actions multiple steps into the future, all in rich
natural language. To demonstrate the capabilities of our model, we introduce
the \emph{Tasty Videos Dataset V2}, a collection of 4022 recipes for zero-shot
learning, recognition and anticipation. Extensive experiments with various
evaluation metrics demonstrate the potential of our method for generalization,
given limited video data for training models.
Related papers
- VILP: Imitation Learning with Latent Video Planning [19.25411361966752]
This paper introduces imitation learning with latent video planning (VILP)
Our method is able to generate highly time-aligned videos from multiple views.
Our paper provides a practical example of how to effectively integrate video generation models into robot policies.
arXiv Detail & Related papers (2025-02-03T19:55:57Z) - VideoWorld: Exploring Knowledge Learning from Unlabeled Videos [119.35107657321902]
This work explores whether a deep generative model can learn complex knowledge solely from visual input.
We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks.
arXiv Detail & Related papers (2025-01-16T18:59:10Z) - An Empirical Study of Autoregressive Pre-training from Videos [67.15356613065542]
We treat videos as visual tokens and train transformer models to autoregressively predict future tokens.
Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens.
Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance.
arXiv Detail & Related papers (2025-01-09T18:59:58Z) - Video In-context Learning [46.40277880351059]
In this paper, we study video in-context learning, where the model starts from an existing video clip and generates diverse potential future sequences.
To achieve this, we provide a clear definition of the task, and train an autoregressive Transformer on video datasets.
We design various evaluation metrics, including both objective and subjective measures, to demonstrate the visual quality and semantic accuracy of generation results.
arXiv Detail & Related papers (2024-07-10T04:27:06Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z) - REST: REtrieve & Self-Train for generative action recognition [54.90704746573636]
We propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition.
We show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting.
We introduce REST, a training framework consisting of two key components.
arXiv Detail & Related papers (2022-09-29T17:57:01Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - Learning Video Representations from Textual Web Supervision [97.78883761035557]
We propose to use text as a method for learning video representations.
We collect 70M video clips shared publicly on the Internet and train a model to pair each video with its associated text.
We find that this approach is an effective method of pre-training video representations.
arXiv Detail & Related papers (2020-07-29T16:19:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.