Unified Coarse-to-Fine Alignment for Video-Text Retrieval
- URL: http://arxiv.org/abs/2309.10091v1
- Date: Mon, 18 Sep 2023 19:04:37 GMT
- Title: Unified Coarse-to-Fine Alignment for Video-Text Retrieval
- Authors: Ziyang Wang, Yi-Lin Sung, Feng Cheng, Gedas Bertasius, Mohit Bansal
- Abstract summary: We propose a Unified Coarse-to-fine Alignment model, dubbed UCoFiA.
Our model captures the cross-modal similarity information at different granularity levels.
We apply the Sinkhorn-Knopp algorithm to normalize the similarities of each level before summing them.
- Score: 71.85966033484597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The canonical approach to video-text retrieval leverages a coarse-grained or
fine-grained alignment between visual and textual information. However,
retrieving the correct video according to the text query is often challenging
as it requires the ability to reason about both high-level (scene) and
low-level (object) visual clues and how they relate to the text query. To this
end, we propose a Unified Coarse-to-fine Alignment model, dubbed UCoFiA.
Specifically, our model captures the cross-modal similarity information at
different granularity levels. To alleviate the effect of irrelevant visual
clues, we also apply an Interactive Similarity Aggregation module (ISA) to
consider the importance of different visual features while aggregating the
cross-modal similarity to obtain a similarity score for each granularity.
Finally, we apply the Sinkhorn-Knopp algorithm to normalize the similarities of
each level before summing them, alleviating over- and under-representation
issues at different levels. By jointly considering the crossmodal similarity of
different granularity, UCoFiA allows the effective unification of multi-grained
alignments. Empirically, UCoFiA outperforms previous state-of-the-art
CLIP-based methods on multiple video-text retrieval benchmarks, achieving 2.4%,
1.4% and 1.3% improvements in text-to-video retrieval R@1 on MSR-VTT,
Activity-Net, and DiDeMo, respectively. Our code is publicly available at
https://github.com/Ziyang412/UCoFiA.
Related papers
- Fine-grained Text-Video Retrieval with Frozen Image Encoders [10.757101644990273]
We propose CrossTVR, a two-stage text-video retrieval architecture.
In the first stage, we leverage existing TVR methods with cosine similarity network for efficient text/video candidate selection.
In the second stage, we propose a novel decoupled video text cross attention module to capture fine-grained multimodal information in spatial and temporal dimensions.
arXiv Detail & Related papers (2023-07-14T02:57:00Z) - UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities.
We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z) - Are All Combinations Equal? Combining Textual and Visual Features with
Multiple Space Learning for Text-Based Video Retrieval [9.537322316673617]
We investigate how to optimally combine multiple diverse textual and visual features into feature pairs.
To learn these representations our proposed network architecture is trained by following a multiple space learning procedure.
arXiv Detail & Related papers (2022-11-21T11:08:13Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval [26.581384985173116]
In text-video retrieval, the objective is to learn a cross-modal similarity function between a text and a video.
We propose a cross-modal attention model called X-Pool that reasons between a text and the frames of a video.
arXiv Detail & Related papers (2022-03-28T20:47:37Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - Scene Text Retrieval via Joint Text Detection and Similarity Learning [68.24531728554892]
Scene text retrieval aims to localize and search all text instances from an image gallery, which are the same or similar to a given query text.
We address this problem by directly learning a cross-modal similarity between a query text and each text instance from natural images.
In this way, scene text retrieval can be simply performed by ranking the detected text instances with the learned similarity.
arXiv Detail & Related papers (2021-04-04T07:18:38Z) - Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using
Transformer Encoders [14.634046503477979]
We present a novel approach called Transformer Reasoning and Alignment Network (TERAN)
TERAN enforces a fine-grained match between the underlying components of images and sentences.
On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks.
arXiv Detail & Related papers (2020-08-12T11:02:40Z) - Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [72.52804406378023]
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web.
To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning model, which decomposes video-text matching into global-to-local levels.
arXiv Detail & Related papers (2020-03-01T03:44:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.