Related papers: SimVTP: Simple Video Text Pre-training with Masked Autoencoders

SimVTP: Simple Video Text Pre-training with Masked Autoencoders

URL: http://arxiv.org/abs/2212.03490v1
Date: Wed, 7 Dec 2022 07:14:22 GMT
Title: SimVTP: Simple Video Text Pre-training with Masked Autoencoders
Authors: Yue Ma, Tianyu Yang, Yin Shan, Xiu Li
Abstract summary: This paper presents SimVTP: a Simple Video-Text Pretraining framework via masked autoencoders. We randomly mask out the spatial-temporal tubes of input video and the word tokens of input text. Thanks to the unified autoencoder, SimVTP reconstructs the masked signal of one modality with the help from another modality.
Score: 22.274024313475646
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents SimVTP: a Simple Video-Text Pretraining framework via masked autoencoders. We randomly mask out the spatial-temporal tubes of input video and the word tokens of input text and then feed them into a unified autencoder to reconstruct the missing pixels and words. Our SimVTP has several properties: 1) Thanks to the unified autoencoder, SimVTP reconstructs the masked signal of one modality with the help from another modality, which implicitly learns the cross-modal alignment between video tubes and text tokens. 2) SimVTP not only benefits from a high video masking ratio (e.g. 90%) due to the temporal redundancy of video, but also needs a high text masking ratio (e.g. 75%), which is much higher than BERT (e.g. 15%), to achieve optimal performance. This is because the aid of video modality makes text reconstruction less challenging, which thus needs a higher mask ratio to make the pretext harder for useful feature learning. 3) Equipping SimVTP with video-text contrastive learning (VTC) and video-text matching (VTM), which are two commonly used cross-modal training strategies, could further improve the transferable performance significantly. 4) SimVTP is dataefficent, e.g., pre-training only on 10% data of WebVid-2M, SimVTP achieves surprisingly good results (43.8 R@1) on MSRVTT, which is far above recent state-of-the-art methods pre-trained on both CC3M and WebVid-2M. We transfer our pre-trained model to various downstream tasks and achieve superior performance. The codes and models will be released at https://github.com/mayuelala/SimVTP.

Related papers

Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding [12.215829700340988]
Video-XL-Pro is an efficient method for extremely long video understanding. Video-XL-Pro can process over 8K frames on a single A100 GPU.
arXiv Detail & Related papers (2025-03-24T09:21:48Z)
HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models [63.65066762436074]
HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks. It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks.
arXiv Detail & Related papers (2025-03-14T15:36:39Z)
RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning [69.23782518456932]
We propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA) We bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2. To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE.
arXiv Detail & Related papers (2024-05-11T16:22:00Z)
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking [57.552798046137646]
Video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-29T14:28:41Z)
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval [37.05164804180039]
We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pretraining (MAC) Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model. Coupling these designs enables efficient end-to-end pre-training: reduce FLOPs (60% off), accelerate pre-training (by 3x), and improve performance.
arXiv Detail & Related papers (2022-12-02T05:44:23Z)
Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval [9.537322316673617]
We investigate how to optimally combine multiple diverse textual and visual features into feature pairs. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure.
arXiv Detail & Related papers (2022-11-21T11:08:13Z)
Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V) We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z)
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels. contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text. There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z)
All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations. The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z)
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization [65.09758931804478]
Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs. A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones.
arXiv Detail & Related papers (2022-03-14T13:15:09Z)
Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks. We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment. Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z)
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data. We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.