SimVTP: Simple Video Text Pre-training with Masked Autoencoders
- URL: http://arxiv.org/abs/2212.03490v1
- Date: Wed, 7 Dec 2022 07:14:22 GMT
- Title: SimVTP: Simple Video Text Pre-training with Masked Autoencoders
- Authors: Yue Ma, Tianyu Yang, Yin Shan, Xiu Li
- Abstract summary: This paper presents SimVTP: a Simple Video-Text Pretraining framework via masked autoencoders.
We randomly mask out the spatial-temporal tubes of input video and the word tokens of input text.
Thanks to the unified autoencoder, SimVTP reconstructs the masked signal of one modality with the help from another modality.
- Score: 22.274024313475646
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents SimVTP: a Simple Video-Text Pretraining framework via
masked autoencoders. We randomly mask out the spatial-temporal tubes of input
video and the word tokens of input text and then feed them into a unified
autencoder to reconstruct the missing pixels and words. Our SimVTP has several
properties: 1) Thanks to the unified autoencoder, SimVTP reconstructs the
masked signal of one modality with the help from another modality, which
implicitly learns the cross-modal alignment between video tubes and text
tokens. 2) SimVTP not only benefits from a high video masking ratio (e.g. 90%)
due to the temporal redundancy of video, but also needs a high text masking
ratio (e.g. 75%), which is much higher than BERT (e.g. 15%), to achieve optimal
performance. This is because the aid of video modality makes text
reconstruction less challenging, which thus needs a higher mask ratio to make
the pretext harder for useful feature learning. 3) Equipping SimVTP with
video-text contrastive learning (VTC) and video-text matching (VTM), which are
two commonly used cross-modal training strategies, could further improve the
transferable performance significantly. 4) SimVTP is dataefficent, e.g.,
pre-training only on 10% data of WebVid-2M, SimVTP achieves surprisingly good
results (43.8 R@1) on MSRVTT, which is far above recent state-of-the-art
methods pre-trained on both CC3M and WebVid-2M. We transfer our pre-trained
model to various downstream tasks and achieve superior performance. The codes
and models will be released at https://github.com/mayuelala/SimVTP.
Related papers
- Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding [12.215829700340988]
Video-XL-Pro is an efficient method for extremely long video understanding.
Video-XL-Pro can process over 8K frames on a single A100 GPU.
arXiv Detail & Related papers (2025-03-24T09:21:48Z) - HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models [63.65066762436074]
HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks.
It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks.
arXiv Detail & Related papers (2025-03-14T15:36:39Z) - RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning [69.23782518456932]
We propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA)
We bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2.
To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking [57.552798046137646]
Video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models.
We successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-29T14:28:41Z) - Masked Contrastive Pre-Training for Efficient Video-Text Retrieval [37.05164804180039]
We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pretraining (MAC)
Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model.
Coupling these designs enables efficient end-to-end pre-training: reduce FLOPs (60% off), accelerate pre-training (by 3x), and improve performance.
arXiv Detail & Related papers (2022-12-02T05:44:23Z) - Are All Combinations Equal? Combining Textual and Visual Features with
Multiple Space Learning for Text-Based Video Retrieval [9.537322316673617]
We investigate how to optimally combine multiple diverse textual and visual features into feature pairs.
To learn these representations our proposed network architecture is trained by following a multiple space learning procedure.
arXiv Detail & Related papers (2022-11-21T11:08:13Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations.
The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z) - MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One
More Step Towards Generalization [65.09758931804478]
Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs.
A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones.
arXiv Detail & Related papers (2022-03-14T13:15:09Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data.
We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.