MILES: Visual BERT Pre-training with Injected Language Semantics for
Video-text Retrieval
- URL: http://arxiv.org/abs/2204.12408v1
- Date: Tue, 26 Apr 2022 16:06:31 GMT
- Title: MILES: Visual BERT Pre-training with Injected Language Semantics for
Video-text Retrieval
- Authors: Yuying Ge, Yixiao Ge, Xihui Liu, Alex Jinpeng Wang, Jianping Wu, Ying
Shan, Xiaohu Qie and Ping Luo
- Abstract summary: Methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols.
Our method outperforms state-of-the-art methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols.
- Score: 43.2299969152561
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dominant pre-training work for video-text retrieval mainly adopt the
"dual-encoder" architectures to enable efficient retrieval, where two separate
encoders are used to contrast global video and text representations, but ignore
detailed local semantics. The recent success of image BERT pre-training with
masked visual modeling that promotes the learning of local visual context,
motivates a possible solution to address the above limitation. In this work, we
for the first time investigate masked visual modeling in video-text
pre-training with the "dual-encoder" architecture. We perform Masked visual
modeling with Injected LanguagE Semantics (MILES) by employing an extra
snapshot video encoder as an evolving "tokenizer" to produce reconstruction
targets for masked video patch prediction. Given the corrupted video, the video
encoder is trained to recover text-aligned features of the masked patches via
reasoning with the visible regions along the spatial and temporal dimensions,
which enhances the discriminativeness of local visual features and the
fine-grained cross-modality alignment. Our method outperforms state-of-the-art
methods for text-to-video retrieval on four datasets with both zero-shot and
fine-tune evaluation protocols. Our approach also surpasses the baseline models
significantly on zero-shot action recognition, which can be cast as
video-to-text retrieval.
Related papers
- Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval [24.691270610091554]
In this paper, we aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts.
We obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.
arXiv Detail & Related papers (2023-08-15T08:54:25Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - CLIP2Video: Mastering Video-Text Retrieval via Image CLIP [13.270902407320005]
We present CLIP2Video network to transfer the image-language training model to video-text retrieval in an end-to-end manner.
We conduct thorough ablation studies, and achieve state-of-the-art performance on text-to-video and video-to-text retrieval benchmarks.
arXiv Detail & Related papers (2021-06-21T13:30:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.