Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval
- URL: http://arxiv.org/abs/2103.15686v1
- Date: Mon, 29 Mar 2021 15:15:09 GMT
- Title: Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval
- Authors: Rui Zhao, Kecheng Zheng, Zheng-Jun Zha, Hongtao Xie and Jiebo Luo
- Abstract summary: Cross-modal video-text retrieval is a challenging task in the field of vision and language.
Existing approaches for this task all focus on how to design encoding model through a hard negative ranking loss.
We propose a novel memory enhanced embedding learning (MEEL) method for videotext retrieval.
- Score: 155.32369959647437
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-modal video-text retrieval, a challenging task in the field of vision
and language, aims at retrieving corresponding instance giving sample from
either modality. Existing approaches for this task all focus on how to design
encoding model through a hard negative ranking loss, leaving two key problems
unaddressed during this procedure. First, in the training stage, only a
mini-batch of instance pairs is available in each iteration. Therefore, this
kind of hard negatives is locally mined inside a mini-batch while ignoring the
global negative samples among the dataset. Second, there are many text
descriptions for one video and each text only describes certain local features
of a video. Previous works for this task did not consider to fuse the multiply
texts corresponding to a video during the training. In this paper, to solve the
above two problems, we propose a novel memory enhanced embedding learning
(MEEL) method for videotext retrieval. To be specific, we construct two kinds
of memory banks respectively: cross-modal memory module and text center memory
module. The cross-modal memory module is employed to record the instance
embeddings of all the datasets for global negative mining. To avoid the fast
evolving of the embedding in the memory bank during training, we utilize a
momentum encoder to update the features by a moving-averaging strategy. The
text center memory module is designed to record the center information of the
multiple textual instances corresponding to a video, and aims at bridging these
textual instances together. Extensive experimental results on two challenging
benchmarks, i.e., MSR-VTT and VATEX, demonstrate the effectiveness of the
proposed method.
Related papers
- Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval [9.899703354116962]
Dense video captioning aims to automatically localize and caption all events within untrimmed video.
We propose a novel framework inspired by the cognitive information processing of humans.
Our model utilizes external memory to incorporate prior knowledge.
arXiv Detail & Related papers (2024-04-11T09:58:23Z) - Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial
Margin Contrastive Learning [35.404100473539195]
Text-video retrieval aims to rank relevant text/video higher than irrelevant ones.
Recent contrastive learning methods have shown promising results for text-video retrieval.
This paper improves contrastive learning using two novel techniques.
arXiv Detail & Related papers (2023-09-20T06:08:11Z) - In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - RaP: Redundancy-aware Video-language Pre-training for Text-Video
Retrieval [61.77760317554826]
We propose Redundancy-aware Video-language Pre-training.
We design a redundancy measurement of video patches and text tokens by calculating the cross-modal minimum dis-similarity.
We evaluate our method on four benchmark datasets, MSRVTT, MSVD, DiDeMo, and LSMDC.
arXiv Detail & Related papers (2022-10-13T10:11:41Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Reading-strategy Inspired Visual Representation Learning for
Text-to-Video Retrieval [41.420760047617506]
Cross-modal representation learning projects both videos and sentences into common spaces for semantic similarity.
Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos.
Our model RIVRL achieves a new state-of-the-art on TGIF and VATEX.
arXiv Detail & Related papers (2022-01-23T03:38:37Z) - BridgeFormer: Bridging Video-text Retrieval with Multiple Choice
Questions [38.843518809230524]
We introduce a novel pretext task dubbed Multiple Choice Questions (MCQ)
A module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features.
In the form of questions and answers, the semantic associations between local video-text features can be properly established.
Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets.
arXiv Detail & Related papers (2022-01-13T09:33:54Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.