ProTA: Probabilistic Token Aggregation for Text-Video Retrieval
- URL: http://arxiv.org/abs/2404.12216v2
- Date: Sat, 20 Apr 2024 04:33:08 GMT
- Title: ProTA: Probabilistic Token Aggregation for Text-Video Retrieval
- Authors: Han Fang, Xianghao Zang, Chao Ban, Zerun Feng, Lanxiang Zhou, Zhongjiang He, Yongxiang Li, Hao Sun,
- Abstract summary: We propose Probabilistic Token Aggregation (ProTA) to handle cross-modal interaction with content asymmetry.
ProTA achieves significant improvements on MSR-VTT (50.9%), LSMDC (25.8%), and DiDeMo (47.2%)
- Score: 15.891020334480826
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-video retrieval aims to find the most relevant cross-modal samples for a given query. Recent methods focus on modeling the whole spatial-temporal relations. However, since video clips contain more diverse content than captions, the model aligning these asymmetric video-text pairs has a high risk of retrieving many false positive results. In this paper, we propose Probabilistic Token Aggregation (ProTA) to handle cross-modal interaction with content asymmetry. Specifically, we propose dual partial-related aggregation to disentangle and re-aggregate token representations in both low-dimension and high-dimension spaces. We propose token-based probabilistic alignment to generate token-level probabilistic representation and maintain the feature representation diversity. In addition, an adaptive contrastive loss is proposed to learn compact cross-modal distribution space. Based on extensive experiments, ProTA achieves significant improvements on MSR-VTT (50.9%), LSMDC (25.8%), and DiDeMo (47.2%).
Related papers
- ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models.
We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods.
We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z) - UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities.
We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z) - Expectation-Maximization Contrastive Learning for Compact
Video-and-Language Representations [54.62547989034184]
We propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations.
Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space.
Experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations.
arXiv Detail & Related papers (2022-11-21T13:12:44Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Disentangled Representation Learning for Text-Video Retrieval [51.861423831566626]
Cross-modality interaction is a critical component in Text-Video Retrieval (TVR)
We study the interaction paradigm in depth, where we find that its computation can be split into two terms.
We propose a disentangled framework to capture a sequential and hierarchical representation.
arXiv Detail & Related papers (2022-03-14T13:55:33Z) - Probabilistic Embeddings for Cross-Modal Retrieval [38.04859099157609]
Cross-modal retrieval methods build a common representation space for samples from multiple modalities.
In this paper, we argue that deterministic functions are not sufficiently powerful to capture such one-to-many correspondences.
Instead, we propose to use Probabilistic Cross-Modal Embedding (PCME), where samples from the different modalities are represented as probabilistic distributions in the common embedding space.
arXiv Detail & Related papers (2021-01-13T13:58:00Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.