Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
- URL: http://arxiv.org/abs/2212.00986v2
- Date: Mon, 5 Dec 2022 04:08:23 GMT
- Title: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
- Authors: Fangxun Shu, Biaolong Chen, Yue Liao, Shuwen Xiao, Wenyu Sun, Xiaobo
Li, Yousong Zhu, Jinqiao Wang and Si Liu
- Abstract summary: We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pretraining (MAC)
Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model.
Coupling these designs enables efficient end-to-end pre-training: reduce FLOPs (60% off), accelerate pre-training (by 3x), and improve performance.
- Score: 37.05164804180039
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a simple yet effective end-to-end Video-language Pre-training
(VidLP) framework, Masked Contrastive Video-language Pretraining (MAC), for
video-text retrieval tasks. Our MAC aims to reduce video representation's
spatial and temporal redundancy in the VidLP model by a mask sampling mechanism
to improve pre-training efficiency. Comparing conventional temporal sparse
sampling, we propose to randomly mask a high ratio of spatial regions and only
feed visible regions into the encoder as sparse spatial sampling. Similarly, we
adopt the mask sampling technique for text inputs for consistency. Instead of
blindly applying the mask-then-prediction paradigm from MAE, we propose a
masked-then-alignment paradigm for efficient video-text alignment. The
motivation is that video-text retrieval tasks rely on high-level alignment
rather than low-level reconstruction, and multimodal alignment with masked
modeling encourages the model to learn a robust and general multimodal
representation from incomplete and unstable inputs. Coupling these designs
enables efficient end-to-end pre-training: reduce FLOPs (60% off), accelerate
pre-training (by 3x), and improve performance. Our MAC achieves
state-of-the-art results on various video-text retrieval datasets, including
MSR-VTT, DiDeMo, and ActivityNet. Our approach is omnivorous to input
modalities. With minimal modifications, we achieve competitive results on
image-text retrieval tasks.
Related papers
- MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation [109.19165503929992]
Open-vocabulary image segmentation has been advanced through the synergy between mask generators and vision-language models.
We present a new fine-tuning framework, named MaskCLIP++, which uses ground-truth masks instead of generated masks.
We achieve performance improvements of +1.7, +2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20 datasets.
arXiv Detail & Related papers (2024-12-16T05:44:45Z) - Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption.
compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs.
We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z) - Resource-Efficient Multiview Perception: Integrating Semantic Masking with Masked Autoencoders [6.498925999634298]
This paper presents a novel approach for communication-efficient distributed multiview detection and tracking using masked autoencoders (MAEs)
We introduce a semantic-guided masking strategy that leverages pre-trained segmentation models and a tunable power function to prioritize informative image regions.
We evaluate our method on both virtual and real-world multiview datasets, demonstrating comparable performance in terms of detection and tracking performance metrics.
arXiv Detail & Related papers (2024-10-07T08:06:41Z) - Text-Guided Video Masked Autoencoder [12.321239366215426]
We introduce a novel text-guided masking algorithm (TGM) that masks the video regions with highest correspondence to paired captions.
We show that across existing masking algorithms, unifying MAE and masked video-text contrastive learning improves downstream performance compared to pure MAE.
arXiv Detail & Related papers (2024-08-01T17:58:19Z) - MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers [30.924202893340087]
State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks.
This paper breaks down the text-based video editing task into two stages.
First, we leverage an pre-trained text-to-image diffusion model to simultaneously edit fews in a zero-shot way.
Second, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers.
arXiv Detail & Related papers (2023-12-19T07:05:39Z) - Mask Propagation for Efficient Video Semantic Segmentation [63.09523058489429]
Video Semantic baseline degradation (VSS) involves assigning a semantic label to each pixel in a video sequence.
We propose an efficient mask propagation framework for VSS, called SSSS.
Our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former with only up to 2% mIoU on the Cityscapes validation set.
arXiv Detail & Related papers (2023-10-29T09:55:28Z) - Mask to reconstruct: Cooperative Semantics Completion for Video-text
Retrieval [19.61947785487129]
Mask for Semantics Completion (MASCOT) based on semantic-based masked modeling.
Our MASCOT performs state-of-the-art performance on four major text-video retrieval benchmarks.
arXiv Detail & Related papers (2023-05-13T12:31:37Z) - RaP: Redundancy-aware Video-language Pre-training for Text-Video
Retrieval [61.77760317554826]
We propose Redundancy-aware Video-language Pre-training.
We design a redundancy measurement of video patches and text tokens by calculating the cross-modal minimum dis-similarity.
We evaluate our method on four benchmark datasets, MSRVTT, MSVD, DiDeMo, and LSMDC.
arXiv Detail & Related papers (2022-10-13T10:11:41Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.