UATVR: Uncertainty-Adaptive Text-Video Retrieval
- URL: http://arxiv.org/abs/2301.06309v2
- Date: Sat, 19 Aug 2023 02:28:10 GMT
- Title: UATVR: Uncertainty-Adaptive Text-Video Retrieval
- Authors: Bo Fang, Wenhao Wu, Chang Liu, Yu Zhou, Yuxin Song, Weiping Wang,
Xiangbo Shu, Xiangyang Ji, Jingdong Wang
- Abstract summary: A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities.
We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
- Score: 90.8952122146241
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: With the explosive growth of web videos and emerging large-scale
vision-language pre-training models, e.g., CLIP, retrieving videos of interest
with text instructions has attracted increasing attention. A common practice is
to transfer text-video pairs to the same embedding space and craft cross-modal
interactions with certain entities in specific granularities for semantic
correspondence. Unfortunately, the intrinsic uncertainties of optimal entity
combinations in appropriate granularities for cross-modal queries are
understudied, which is especially critical for modalities with hierarchical
semantics, e.g., video, text, etc. In this paper, we propose an
Uncertainty-Adaptive Text-Video Retrieval approach, termed UATVR, which models
each look-up as a distribution matching procedure. Concretely, we add
additional learnable tokens in the encoders to adaptively aggregate
multi-grained semantics for flexible high-level reasoning. In the refined
embedding space, we represent text-video pairs as probabilistic distributions
where prototypes are sampled for matching evaluation. Comprehensive experiments
on four benchmarks justify the superiority of our UATVR, which achieves new
state-of-the-art results on MSR-VTT (50.8%), VATEX (64.5%), MSVD (49.7%), and
DiDeMo (45.8%). The code is available at https://github.com/bofang98/UATVR.
Related papers
- Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - ProTA: Probabilistic Token Aggregation for Text-Video Retrieval [15.891020334480826]
We propose Probabilistic Token Aggregation (ProTA) to handle cross-modal interaction with content asymmetry.
ProTA achieves significant improvements on MSR-VTT (50.9%), LSMDC (25.8%), and DiDeMo (47.2%)
arXiv Detail & Related papers (2024-04-18T14:20:30Z) - Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks [25.96897989272303]
Main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content.
We propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit.
We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video.
arXiv Detail & Related papers (2024-01-06T09:38:55Z) - Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial
Margin Contrastive Learning [35.404100473539195]
Text-video retrieval aims to rank relevant text/video higher than irrelevant ones.
Recent contrastive learning methods have shown promising results for text-video retrieval.
This paper improves contrastive learning using two novel techniques.
arXiv Detail & Related papers (2023-09-20T06:08:11Z) - Expectation-Maximization Contrastive Learning for Compact
Video-and-Language Representations [54.62547989034184]
We propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations.
Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space.
Experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations.
arXiv Detail & Related papers (2022-11-21T13:12:44Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.