Video Pre-trained Transformer: A Multimodal Mixture of Pre-trained
Experts
- URL: http://arxiv.org/abs/2304.10505v1
- Date: Fri, 24 Mar 2023 17:18:40 GMT
- Title: Video Pre-trained Transformer: A Multimodal Mixture of Pre-trained
Experts
- Authors: Kastan Day, Daniel Christl, Rohan Salvi, Pranav Sriram
- Abstract summary: We present Video Pre-trained Transformer.
It uses four SOTA encoder models to convert a video into a sequence of compact embeddings.
It learns using an autoregressive causal language modeling loss by predicting the words spoken in YouTube videos.
- Score: 2.457872341625575
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present Video Pre-trained Transformer. VPT uses four SOTA encoder models
from prior work to convert a video into a sequence of compact embeddings. Our
backbone, based on a reference Flan-T5-11B architecture, learns a universal
representation of the video that is a non-linear sum of the encoder models. It
learns using an autoregressive causal language modeling loss by predicting the
words spoken in YouTube videos. Finally, we evaluate on standard downstream
benchmarks by training fully connected prediction heads for each task. To the
best of our knowledge, this is the first use of multiple frozen SOTA models as
encoders in an "embedding -> backbone -> prediction head" design pattern - all
others have trained their own joint encoder models. Additionally, we include
more modalities than the current SOTA, Merlot Reserve, by adding explicit Scene
Graph information. For these two reasons, we believe it could combine the
world's best open-source models to achieve SOTA performance. Initial
experiments demonstrate the model is learning appropriately, but more
experimentation and compute is necessary, and already in progress, to realize
our loftier goals. Alongside this work, we build on the YT-20M dataset,
reproducing it and adding 25,000 personally selected YouTube videos to its
corpus. All code and model checkpoints are open sourced under a standard MIT
license.
Related papers
- AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Video Prediction Models as Rewards for Reinforcement Learning [127.53893027811027]
VIPER is an algorithm that leverages pretrained video prediction models as action-free reward signals for reinforcement learning.
We see our work as starting point for scalable reward specification from unlabeled videos.
arXiv Detail & Related papers (2023-05-23T17:59:33Z) - VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking [57.552798046137646]
Video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models.
We successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-29T14:28:41Z) - Seer: Language Instructed Video Prediction with Latent Diffusion Models [43.708550061909754]
Text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning.
We propose a sample and computation-efficient model, named textbfSeer, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis.
With the adaptable-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames.
arXiv Detail & Related papers (2023-03-27T03:12:24Z) - VMFormer: End-to-End Video Matting with Transformer [48.97730965527976]
Video matting aims to predict alpha mattes for each frame from a given input video sequence.
Recent solutions to video matting have been dominated by deep convolutional neural networks (CNN)
We propose VMFormer: a transformer-based end-to-end method for video matting.
arXiv Detail & Related papers (2022-08-26T17:51:02Z) - What Language Model Architecture and Pretraining Objective Work Best for
Zero-Shot Generalization? [50.84738303888189]
We present a large-scale evaluation of modeling choices and their impact on zero-shot generalization.
We train models with over 5 billion parameters for more than 170 billion tokens.
We find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models.
arXiv Detail & Related papers (2022-04-12T14:19:49Z) - All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations.
The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.