Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
- URL: http://arxiv.org/abs/2212.00986v2
- Date: Mon, 5 Dec 2022 04:08:23 GMT
- Title: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
- Authors: Fangxun Shu, Biaolong Chen, Yue Liao, Shuwen Xiao, Wenyu Sun, Xiaobo
Li, Yousong Zhu, Jinqiao Wang and Si Liu
- Abstract summary: We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pretraining (MAC)
Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model.
Coupling these designs enables efficient end-to-end pre-training: reduce FLOPs (60% off), accelerate pre-training (by 3x), and improve performance.
- Score: 37.05164804180039
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a simple yet effective end-to-end Video-language Pre-training
(VidLP) framework, Masked Contrastive Video-language Pretraining (MAC), for
video-text retrieval tasks. Our MAC aims to reduce video representation's
spatial and temporal redundancy in the VidLP model by a mask sampling mechanism
to improve pre-training efficiency. Comparing conventional temporal sparse
sampling, we propose to randomly mask a high ratio of spatial regions and only
feed visible regions into the encoder as sparse spatial sampling. Similarly, we
adopt the mask sampling technique for text inputs for consistency. Instead of
blindly applying the mask-then-prediction paradigm from MAE, we propose a
masked-then-alignment paradigm for efficient video-text alignment. The
motivation is that video-text retrieval tasks rely on high-level alignment
rather than low-level reconstruction, and multimodal alignment with masked
modeling encourages the model to learn a robust and general multimodal
representation from incomplete and unstable inputs. Coupling these designs
enables efficient end-to-end pre-training: reduce FLOPs (60% off), accelerate
pre-training (by 3x), and improve performance. Our MAC achieves
state-of-the-art results on various video-text retrieval datasets, including
MSR-VTT, DiDeMo, and ActivityNet. Our approach is omnivorous to input
modalities. With minimal modifications, we achieve competitive results on
image-text retrieval tasks.
Related papers
- Resource-Efficient Multiview Perception: Integrating Semantic Masking with Masked Autoencoders [6.498925999634298]
This paper presents a novel approach for communication-efficient distributed multiview detection and tracking using masked autoencoders (MAEs)
We introduce a semantic-guided masking strategy that leverages pre-trained segmentation models and a tunable power function to prioritize informative image regions.
We evaluate our method on both virtual and real-world multiview datasets, demonstrating comparable performance in terms of detection and tracking performance metrics.
arXiv Detail & Related papers (2024-10-07T08:06:41Z) - Text-Guided Video Masked Autoencoder [12.321239366215426]
We introduce a novel text-guided masking algorithm (TGM) that masks the video regions with highest correspondence to paired captions.
We show that across existing masking algorithms, unifying MAE and masked video-text contrastive learning improves downstream performance compared to pure MAE.
arXiv Detail & Related papers (2024-08-01T17:58:19Z) - Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers [30.924202893340087]
State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks.
This paper breaks down the text-based video editing task into two stages.
First, we leverage an pre-trained text-to-image diffusion model to simultaneously edit fews in a zero-shot way.
Second, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers.
arXiv Detail & Related papers (2023-12-19T07:05:39Z) - Mask Propagation for Efficient Video Semantic Segmentation [63.09523058489429]
Video Semantic baseline degradation (VSS) involves assigning a semantic label to each pixel in a video sequence.
We propose an efficient mask propagation framework for VSS, called SSSS.
Our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former with only up to 2% mIoU on the Cityscapes validation set.
arXiv Detail & Related papers (2023-10-29T09:55:28Z) - Mask to reconstruct: Cooperative Semantics Completion for Video-text
Retrieval [19.61947785487129]
Mask for Semantics Completion (MASCOT) based on semantic-based masked modeling.
Our MASCOT performs state-of-the-art performance on four major text-video retrieval benchmarks.
arXiv Detail & Related papers (2023-05-13T12:31:37Z) - RaP: Redundancy-aware Video-language Pre-training for Text-Video
Retrieval [61.77760317554826]
We propose Redundancy-aware Video-language Pre-training.
We design a redundancy measurement of video patches and text tokens by calculating the cross-modal minimum dis-similarity.
We evaluate our method on four benchmark datasets, MSRVTT, MSVD, DiDeMo, and LSMDC.
arXiv Detail & Related papers (2022-10-13T10:11:41Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data.
We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.