Related papers: Video Pre-trained Transformer: A Multimodal Mixture of Pre-trained Experts

Video Pre-trained Transformer: A Multimodal Mixture of Pre-trained Experts

URL: http://arxiv.org/abs/2304.10505v1
Date: Fri, 24 Mar 2023 17:18:40 GMT
Title: Video Pre-trained Transformer: A Multimodal Mixture of Pre-trained Experts
Authors: Kastan Day, Daniel Christl, Rohan Salvi, Pranav Sriram
Abstract summary: We present Video Pre-trained Transformer. It uses four SOTA encoder models to convert a video into a sequence of compact embeddings. It learns using an autoregressive causal language modeling loss by predicting the words spoken in YouTube videos.
Score: 2.457872341625575
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We present Video Pre-trained Transformer. VPT uses four SOTA encoder models from prior work to convert a video into a sequence of compact embeddings. Our backbone, based on a reference Flan-T5-11B architecture, learns a universal representation of the video that is a non-linear sum of the encoder models. It learns using an autoregressive causal language modeling loss by predicting the words spoken in YouTube videos. Finally, we evaluate on standard downstream benchmarks by training fully connected prediction heads for each task. To the best of our knowledge, this is the first use of multiple frozen SOTA models as encoders in an "embedding -> backbone -> prediction head" design pattern - all others have trained their own joint encoder models. Additionally, we include more modalities than the current SOTA, Merlot Reserve, by adding explicit Scene Graph information. For these two reasons, we believe it could combine the world's best open-source models to achieve SOTA performance. Initial experiments demonstrate the model is learning appropriately, but more experimentation and compute is necessary, and already in progress, to realize our loftier goals. Alongside this work, we build on the YT-20M dataset, reproducing it and adding 25,000 personally selected YouTube videos to its corpus. All code and model checkpoints are open sourced under a standard MIT license.

Related papers

Multi-Modal Video Feature Extraction for Popularity Prediction [2.1149978544067154]
This work aims to predict the popularity of short videos using the videos themselves and their related features. Popularity is measured by four key engagement metrics: view count, like count, comment count, and share count. This study employs video classification models with different architectures and training methods as backbone networks to extract video modality features.
arXiv Detail & Related papers (2025-01-02T18:59:36Z)
Video Prediction Transformers without Recurrence or Convolution [65.93130697098658]
We propose PredFormer, a framework entirely based on Gated Transformers. We provide a comprehensive analysis of 3D Attention in the context of video prediction. The significant improvements in both accuracy and efficiency highlight the potential of PredFormer.
arXiv Detail & Related papers (2024-10-07T03:52:06Z)
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction. Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task. We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z)
Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past. We leverage the large-scale pretraining of image diffusion models which can handle multi-modality. We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z)
Video Prediction Models as Rewards for Reinforcement Learning [127.53893027811027]
VIPER is an algorithm that leverages pretrained video prediction models as action-free reward signals for reinforcement learning. We see our work as starting point for scalable reward specification from unlabeled videos.
arXiv Detail & Related papers (2023-05-23T17:59:33Z)
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking [57.552798046137646]
Video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-29T14:28:41Z)
Seer: Language Instructed Video Prediction with Latent Diffusion Models [43.708550061909754]
Text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning. We propose a sample and computation-efficient model, named textbfSeer, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis. With the adaptable-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames.
arXiv Detail & Related papers (2023-03-27T03:12:24Z)
VMFormer: End-to-End Video Matting with Transformer [48.97730965527976]
Video matting aims to predict alpha mattes for each frame from a given input video sequence. Recent solutions to video matting have been dominated by deep convolutional neural networks (CNN) We propose VMFormer: a transformer-based end-to-end method for video matting.
arXiv Detail & Related papers (2022-08-26T17:51:02Z)
What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? [50.84738303888189]
We present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. We train models with over 5 billion parameters for more than 170 billion tokens. We find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models.
arXiv Detail & Related papers (2022-04-12T14:19:49Z)
All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations. The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.