X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text
Retrieval
- URL: http://arxiv.org/abs/2207.07285v1
- Date: Fri, 15 Jul 2022 04:23:42 GMT
- Title: X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text
Retrieval
- Authors: Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, Rongrong Ji
- Abstract summary: Cross-grained contrast is the contrast between coarse-grained representations and fine-grained representations.
X-CLIP is a novel multi-grained contrastive model for video-text retrieval.
X-CLIP achieves outstanding performance on five widely-used video-text retrieval datasets.
- Score: 87.3821932795969
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-text retrieval has been a crucial and fundamental task in multi-modal
research. The development of video-text retrieval has been considerably
promoted by large-scale multi-modal contrastive pre-training, which primarily
focuses on coarse-grained or fine-grained contrast. However, cross-grained
contrast, which is the contrast between coarse-grained representations and
fine-grained representations, has rarely been explored in prior research.
Compared with fine-grained or coarse-grained contrasts, cross-grained contrast
calculate the correlation between coarse-grained features and each fine-grained
feature, and is able to filter out the unnecessary fine-grained features guided
by the coarse-grained feature during similarity calculation, thus improving the
accuracy of retrieval. To this end, this paper presents a novel multi-grained
contrastive model, namely X-CLIP, for video-text retrieval. However, another
challenge lies in the similarity aggregation problem, which aims to aggregate
fine-grained and cross-grained similarity matrices to instance-level
similarity. To address this challenge, we propose the Attention Over Similarity
Matrix (AOSM) module to make the model focus on the contrast between essential
frames and words, thus lowering the impact of unnecessary frames and words on
retrieval results. With multi-grained contrast and the proposed AOSM module,
X-CLIP achieves outstanding performance on five widely-used video-text
retrieval datasets, including MSR-VTT (49.3 R@1), MSVD (50.4 R@1), LSMDC (26.1
R@1), DiDeMo (47.8 R@1) and ActivityNet (46.2 R@1). It outperforms the previous
state-of-theart by +6.3%, +6.6%, +11.1%, +6.7%, +3.8% relative improvements on
these benchmarks, demonstrating the superiority of multi-grained contrast and
AOSM.
Related papers
- TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval [1.8434042562191815]
We propose a Text-Conditioned Multi-Grained Contrast framework, dubbed TC-MGC.
Our model employs a language-video attention block to generate aggregated frame and video representations conditioned on the word's and text's attention weights over frames.
Empirically, TC-MGC achieves competitive results on multiple text-video retrieval benchmarks.
arXiv Detail & Related papers (2025-04-07T03:33:14Z) - Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval [66.61856014573742]
Text-based person retrieval (TPR) is a challenging task that involves retrieving a specific individual based on a textual description.
Previous methods have attempted to align text and image samples in a modal-shared space.
We propose an effective bi-directional one-to-many embedding paradigm that offers a clear optimization direction for each sample.
arXiv Detail & Related papers (2024-06-09T03:06:55Z) - CPCL: Cross-Modal Prototypical Contrastive Learning for Weakly
Supervised Text-based Person Re-Identification [10.64115914599574]
Weakly supervised text-based person re-identification (TPRe-ID) seeks to retrieve images of a target person using textual descriptions.
The primary challenge is the intra-class differences, encompassing intra-modal feature variations and cross-modal semantic gaps.
In practice, the CPCL introduces the CLIP model to weakly supervised TPRe-ID for the first time, mapping visual and textual instances into a shared latent space.
arXiv Detail & Related papers (2024-01-18T14:27:01Z) - Unified Coarse-to-Fine Alignment for Video-Text Retrieval [71.85966033484597]
We propose a Unified Coarse-to-fine Alignment model, dubbed UCoFiA.
Our model captures the cross-modal similarity information at different granularity levels.
We apply the Sinkhorn-Knopp algorithm to normalize the similarities of each level before summing them.
arXiv Detail & Related papers (2023-09-18T19:04:37Z) - Rethinking Benchmarks for Cross-modal Image-text Retrieval [44.31783230767321]
Cross-modal semantic understanding and matching is a major challenge in image-text retrieval.
In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching.
We propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort.
The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding.
arXiv Detail & Related papers (2023-04-21T09:07:57Z) - UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities.
We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z) - Disentangled Representation Learning for Text-Video Retrieval [51.861423831566626]
Cross-modality interaction is a critical component in Text-Video Retrieval (TVR)
We study the interaction paradigm in depth, where we find that its computation can be split into two terms.
We propose a disentangled framework to capture a sequential and hierarchical representation.
arXiv Detail & Related papers (2022-03-14T13:55:33Z) - A Similarity Inference Metric for RGB-Infrared Cross-Modality Person
Re-identification [66.49212581685127]
Cross-modality person re-identification (re-ID) is a challenging task due to the large discrepancy between IR and RGB modalities.
Existing methods address this challenge typically by aligning feature distributions or image styles across modalities.
This paper presents a novel similarity inference metric (SIM) that exploits the intra-modality sample similarities to circumvent the cross-modality discrepancy.
arXiv Detail & Related papers (2020-07-03T05:28:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.