Related papers: GMMFormer v2: An Uncertainty-aware Framework for Partially Relevant Video Retrieval

GMMFormer v2: An Uncertainty-aware Framework for Partially Relevant Video Retrieval

URL: http://arxiv.org/abs/2405.13824v1
Date: Wed, 22 May 2024 16:55:31 GMT
Title: GMMFormer v2: An Uncertainty-aware Framework for Partially Relevant Video Retrieval
Authors: Yuting Wang, Jinpeng Wang, Bin Chen, Tao Dai, Ruisheng Luo, Shu-Tao Xia,
Abstract summary: We present GMMFormer v2, an uncertainty-aware framework for PRVR. For clip modeling, we improve a strong baseline GMMFormer with a novel temporal consolidation module. We propose a novel optimal matching loss for fine-grained text-clip alignment.
Score: 60.70901959953688
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given a text query, partially relevant video retrieval (PRVR) aims to retrieve untrimmed videos containing relevant moments. Due to the lack of moment annotations, the uncertainty lying in clip modeling and text-clip correspondence leads to major challenges. Despite the great progress, existing solutions either sacrifice efficiency or efficacy to capture varying and uncertain video moments. What's worse, few methods have paid attention to the text-clip matching pattern under such uncertainty, exposing the risk of semantic collapse. To address these issues, we present GMMFormer v2, an uncertainty-aware framework for PRVR. For clip modeling, we improve a strong baseline GMMFormer with a novel temporal consolidation module upon multi-scale contextual features, which maintains efficiency and improves the perception for varying moments. To achieve uncertainty-aware text-clip matching, we upgrade the query diverse loss in GMMFormer to facilitate fine-grained uniformity and propose a novel optimal matching loss for fine-grained text-clip alignment. Their collaboration alleviates the semantic collapse phenomenon and neatly promotes accurate correspondence between texts and moments. We conduct extensive experiments and ablation studies on three PRVR benchmarks, demonstrating remarkable improvement of GMMFormer v2 compared to the past SOTA competitor and the versatility of uncertainty-aware text-clip matching for PRVR. Code is available at \url{https://github.com/huangmozhi9527/GMMFormer_v2}.

Related papers

Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering [36.94781787191615]
We propose a simple yet effective approach with active moment discovering (AMDNet) We are committed to discovering video moments that are semantically consistent with their queries. Experiments on two large-scale video datasets demonstrate the superiority and efficiency of our AMDNet.
arXiv Detail & Related papers (2025-04-15T07:00:18Z)
TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval [1.8434042562191815]
We propose a Text-Conditioned Multi-Grained Contrast framework, dubbed TC-MGC. Our model employs a language-video attention block to generate aggregated frame and video representations conditioned on the word's and text's attention weights over frames. Empirically, TC-MGC achieves competitive results on multiple text-video retrieval benchmarks.
arXiv Detail & Related papers (2025-04-07T03:33:14Z)
Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval [90.72791786676753]
Video-ColBERT introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.
arXiv Detail & Related papers (2025-03-24T17:51:29Z)
On the Consistency of Video Large Language Models in Temporal Comprehension [57.985769348320616]
Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments. We conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding.
arXiv Detail & Related papers (2024-11-20T00:47:17Z)
VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding [44.382937324454254]
Existing Video Corpus Moment Retrieval (VCMR) is limited to coarse-grained understanding. We propose a more challenging fine-grained VCMR benchmark requiring methods to localize the best-matched moment from the corpus. With VERIFIED, we construct a more challenging fine-grained VCMR benchmark containing Charades-FIG, DiDeMo-FIG, and ActivityNet-FIG.
arXiv Detail & Related papers (2024-10-11T07:42:36Z)
Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach [56.610806615527885]
A key challenge in text-video retrieval (TVR) is the information asymmetry between video and text. This paper introduces a data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content. We propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy.
arXiv Detail & Related papers (2024-08-14T01:24:09Z)
Generative Video Diffusion for Unseen Cross-Domain Video Moment Retrieval [58.17315970207874]
Video Moment Retrieval (VMR) requires precise modelling of fine-grained moment-text associations to capture intricate visual-language relationships. Existing methods resort to joint training on both source and target domain videos for cross-domain applications. We explore generative video diffusion for fine-grained editing of source videos controlled by the target sentences.
arXiv Detail & Related papers (2024-01-24T09:45:40Z)
Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval [31.42856682276394]
Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed videos corresponding to a given language query. Existing strategies are often sub-optimal since they ignore the modality imbalance problem. We introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework for more balanced alignment.
arXiv Detail & Related papers (2023-12-19T13:38:48Z)
GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval [59.47258928867802]
Given a text query, partially relevant video retrieval (PRVR) seeks to find videos containing pertinent moments in a database. This paper proposes GMMFormer, a Gaussian-Mixture-Model based Transformer which models clip representations implicitly. Experiments on three large-scale video datasets demonstrate the superiority and efficiency of GMMFormer.
arXiv Detail & Related papers (2023-10-08T15:04:50Z)
Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment. Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z)
Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications. Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities. We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z)
Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query. The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents. We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.