Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial
Margin Contrastive Learning
- URL: http://arxiv.org/abs/2309.11082v3
- Date: Fri, 26 Jan 2024 06:18:59 GMT
- Title: Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial
Margin Contrastive Learning
- Authors: Chen Jiang, Hong Liu, Xuzheng Yu, Qing Wang, Yuan Cheng, Jia Xu,
Zhongyi Liu, Qingpei Guo, Wei Chu, Ming Yang, Yuan Qi
- Abstract summary: Text-video retrieval aims to rank relevant text/video higher than irrelevant ones.
Recent contrastive learning methods have shown promising results for text-video retrieval.
This paper improves contrastive learning using two novel techniques.
- Score: 35.404100473539195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the explosion of web videos makes text-video retrieval
increasingly essential and popular for video filtering, recommendation, and
search. Text-video retrieval aims to rank relevant text/video higher than
irrelevant ones. The core of this task is to precisely measure the cross-modal
similarity between texts and videos. Recently, contrastive learning methods
have shown promising results for text-video retrieval, most of which focus on
the construction of positive and negative pairs to learn text and video
representations. Nevertheless, they do not pay enough attention to hard
negative pairs and lack the ability to model different levels of semantic
similarity. To address these two issues, this paper improves contrastive
learning using two novel techniques. First, to exploit hard examples for robust
discriminative power, we propose a novel Dual-Modal Attention-Enhanced Module
(DMAE) to mine hard negative pairs from textual and visual clues. By further
introducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively
identify all these hard negatives and explicitly highlight their impacts in the
training loss. Second, our work argues that triplet samples can better model
fine-grained semantic similarity compared to pairwise samples. We thereby
present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to
construct partial order triplet samples by automatically generating
fine-grained hard negatives for matched text-video pairs. The proposed TPM-CL
designs an adaptive token masking strategy with cross-modal interaction to
model subtle semantic differences. Extensive experiments demonstrate that the
proposed approach outperforms existing methods on four widely-used text-video
retrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet.
Related papers
- In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - Improving Video Retrieval by Adaptive Margin [18.326296132847332]
The dominant paradigm for video retrieval learns video-text representations by pushing the distance between the similarity of positive pairs and that of negative pairs apart from a fixed margin.
Negative pairs used for training are sampled randomly, which indicates that the semantics between negative pairs may be related or even equivalent.
We propose an adaptive margin changed with the distance between positive and negative pairs to solve the aforementioned issue.
arXiv Detail & Related papers (2023-03-09T08:07:38Z) - Learning a Grammar Inducer from Massive Uncurated Instructional Videos [118.7279072358029]
Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text.
We build a new model that can better learn video-span correlation without manually designed features.
Our model yields higher F1 scores than the previous state-of-the-art systems trained on in-domain data.
arXiv Detail & Related papers (2022-10-22T00:22:55Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment [68.08689660963468]
A new algorithm called Token-Aware Cascade contrastive learning (TACo) improves contrastive learning using two novel techniques.
We set new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.
arXiv Detail & Related papers (2021-08-23T07:24:57Z) - Video-aided Unsupervised Grammar Induction [108.53765268059425]
We investigate video-aided grammar induction, which learns a constituency from both unlabeled text and its corresponding video.
Video provides even richer information, including not only static objects but also actions and state changes useful for inducing verb phrases.
We propose a Multi-Modal Compound PCFG model (MMC-PCFG) to effectively aggregate these rich features from different modalities.
arXiv Detail & Related papers (2021-04-09T14:01:36Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.