Related papers: Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives

Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives

URL: http://arxiv.org/abs/2508.14812v1
Date: Wed, 20 Aug 2025 16:03:56 GMT
Title: Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives
Authors: Haoyu Zhao, Jiaxi Gu, Shicong Wang, Xing Zhang, Hang Xu, Zuxuan Wu, Yu-Gang Jiang,
Abstract summary: Existing methods rely on large-scale pre-training to improve video retrieval performance.<n>We propose a novel framework to learn fine-grained features for better alignment.<n>We also introduce an inference pipeline to improve performance without additional training.
Score: 93.31112073070906
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The explosive growth of video streaming presents challenges in achieving high accuracy and low training costs for video-language retrieval. However, existing methods rely on large-scale pre-training to improve video retrieval performance, resulting in significant computational demands. Additionally, the fine-grained information in videos and texts remains underexplored. To alleviate these problems, we propose a novel framework to learn fine-grained features for better alignment and introduce an inference pipeline to improve performance without additional training. Specifically, we employ coarse-to-fine objectives to understand the semantic information of video-text pairs, including contrastive and matching learning. The fine-grained data used for training is obtained through the Granularity-Aware Representation module, which is designed based on similarity analysis between video frames and words in captions. Furthermore, we observe that the repetition of keywords in the original captions, referred to as "Repetition", can enhance retrieval performance and improve alignment between video and text. Based on this insight, we propose a novel and effective inference pipeline that incorporates a voting mechanism and a new Matching Entropy metric to achieve better retrieval performance without requiring additional pre-training. Experimental results on four benchmarks demonstrate that the proposed method outperforms previous approaches. Additionally, our inference pipeline achieves significant performance improvements, with a 2.1% increase in Recall@1 on the MSR-VTT dataset and a 1.6% increase on the DiDeMo dataset.

Related papers

RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval [99.33724613432922]
We introduce RANKVIDEO, a reasoning-based reranker for video retrieval.<n>RANKVIDEO explicitly reasons over query-video pairs using video content to assess relevance.<n> Experiments on the large-scale MultiVENT 2.0 benchmark demonstrate that RANKVIDEO consistently improves retrieval performance within a two-stage framework.
arXiv Detail & Related papers (2026-02-02T18:40:37Z)
Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric [1.9774761182870912]
We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach.<n>We conduct experiments on the YouCook2 benchmark, showing promising retrieval performance.
arXiv Detail & Related papers (2025-04-06T18:18:09Z)
Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding [61.89781979702939]
This study quantitatively reveals an "impossible trinity" among data quantity, diversity, and quality in pre-training datasets. Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations. We introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods.
arXiv Detail & Related papers (2024-09-29T03:33:35Z)
Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach [56.610806615527885]
A key challenge in text-video retrieval (TVR) is the information asymmetry between video and text.<n>This paper introduces a data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content.<n>We propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy.
arXiv Detail & Related papers (2024-08-14T01:24:09Z)
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning [15.998149438353133]
We propose a two-stage retrieval architecture for text-to-video retrieval. In training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations.
arXiv Detail & Related papers (2024-01-01T08:54:18Z)
Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning [10.486585276898472]
A thorough comprehension of textual data is a fundamental element in multi-modal video analysis tasks. We postulate that understanding the significance of the sentence components according to the target task can potentially enhance the performance of the models. We propose a weakly supervised importance estimation module to compute the relative importance of the components and utilize them to improve different video-language tasks.
arXiv Detail & Related papers (2023-12-10T02:03:51Z)
TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [103.85002875155551]
We propose a novel generalized distillation method, TeachText, for exploiting large-scale language pretraining. We extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time. Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time.
arXiv Detail & Related papers (2021-04-16T17:55:28Z)
CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning [49.18591896085498]
We propose CUPID to bridge the domain gap between source and target data. CUPID yields new state-of-the-art performance across multiple video-language and video tasks.
arXiv Detail & Related papers (2021-04-01T06:42:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.