Related papers: ProTA: Probabilistic Token Aggregation for Text-Video Retrieval

ProTA: Probabilistic Token Aggregation for Text-Video Retrieval

URL: http://arxiv.org/abs/2404.12216v2
Date: Sat, 20 Apr 2024 04:33:08 GMT
Title: ProTA: Probabilistic Token Aggregation for Text-Video Retrieval
Authors: Han Fang, Xianghao Zang, Chao Ban, Zerun Feng, Lanxiang Zhou, Zhongjiang He, Yongxiang Li, Hao Sun,
Abstract summary: We propose Probabilistic Token Aggregation (ProTA) to handle cross-modal interaction with content asymmetry. ProTA achieves significant improvements on MSR-VTT (50.9%), LSMDC (25.8%), and DiDeMo (47.2%)
Score: 15.891020334480826
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-video retrieval aims to find the most relevant cross-modal samples for a given query. Recent methods focus on modeling the whole spatial-temporal relations. However, since video clips contain more diverse content than captions, the model aligning these asymmetric video-text pairs has a high risk of retrieving many false positive results. In this paper, we propose Probabilistic Token Aggregation (ProTA) to handle cross-modal interaction with content asymmetry. Specifically, we propose dual partial-related aggregation to disentangle and re-aggregate token representations in both low-dimension and high-dimension spaces. We propose token-based probabilistic alignment to generate token-level probabilistic representation and maintain the feature representation diversity. In addition, an adaptive contrastive loss is proposed to learn compact cross-modal distribution space. Based on extensive experiments, ProTA achieves significant improvements on MSR-VTT (50.9%), LSMDC (25.8%), and DiDeMo (47.2%).

Related papers

GMM-Based Comprehensive Feature Extraction and Relative Distance Preservation For Few-Shot Cross-Modal Retrieval [13.928213494843744]
Few-shot cross-modal retrieval focuses on learning cross-modal representations with limited training samples.<n>Existing methods often fail to adequately model the multi-peak distribution of few-shot cross-modal data.<n>We introduce a new strategy for cross-modal semantic alignment, which constrains the relative distances between image and text feature distributions.
arXiv Detail & Related papers (2025-05-19T16:25:55Z)
TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval [1.8434042562191815]
We propose a Text-Conditioned Multi-Grained Contrast framework, dubbed TC-MGC. Our model employs a language-video attention block to generate aggregated frame and video representations conditioned on the word's and text's attention weights over frames. Empirically, TC-MGC achieves competitive results on multiple text-video retrieval benchmarks.
arXiv Detail & Related papers (2025-04-07T03:33:14Z)
ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models. We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods. We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z)
UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities. We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z)
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations [54.62547989034184]
We propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space. Experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations.
arXiv Detail & Related papers (2022-11-21T13:12:44Z)
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels. contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text. There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z)
Disentangled Representation Learning for Text-Video Retrieval [51.861423831566626]
Cross-modality interaction is a critical component in Text-Video Retrieval (TVR) We study the interaction paradigm in depth, where we find that its computation can be split into two terms. We propose a disentangled framework to capture a sequential and hierarchical representation.
arXiv Detail & Related papers (2022-03-14T13:55:33Z)
Probabilistic Embeddings for Cross-Modal Retrieval [38.04859099157609]
Cross-modal retrieval methods build a common representation space for samples from multiple modalities. In this paper, we argue that deterministic functions are not sufficiently powerful to capture such one-to-many correspondences. Instead, we propose to use Probabilistic Cross-Modal Embedding (PCME), where samples from the different modalities are represented as probabilistic distributions in the common embedding space.
arXiv Detail & Related papers (2021-01-13T13:58:00Z)
Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework. We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.