GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval
- URL: http://arxiv.org/abs/2601.00584v1
- Date: Fri, 02 Jan 2026 06:04:58 GMT
- Title: GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval
- Authors: Mingyu Jeon, Sunjae Yoon, Jonghee Kim, Junyeoung Kim,
- Abstract summary: Zero-shot video moment retrieval (ZVMR) is the task of localizing a temporal moment within an untrimmed video using a natural language query without relying on task-specific training data.<n>Previous studies in ZVMR have attempted to achieve alignment by leveraging high-quality pre-trained knowledge that represents video and language in a joint space.<n>We propose a training-free framework, called Granularity-Aware Alignment (GranAlign), that bridges this gap between coarse and fine semantic representations.
- Score: 12.668753075288308
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Zero-shot video moment retrieval (ZVMR) is the task of localizing a temporal moment within an untrimmed video using a natural language query without relying on task-specific training data. The primary challenge in this setting lies in the mismatch in semantic granularity between textual queries and visual content. Previous studies in ZVMR have attempted to achieve alignment by leveraging high-quality pre-trained knowledge that represents video and language in a joint space. However, these approaches failed to balance the semantic granularity between the pre-trained knowledge provided by each modality for a given scene. As a result, despite the high quality of each modality's representations, the mismatch in granularity led to inaccurate retrieval. In this paper, we propose a training-free framework, called Granularity-Aware Alignment (GranAlign), that bridges this gap between coarse and fine semantic representations. Our approach introduces two complementary techniques: granularity-based query rewriting to generate varied semantic granularities, and query-aware caption generation to embed query intent into video content. By pairing multi-level queries with both query-agnostic and query-aware captions, we effectively resolve semantic mismatches. As a result, our method sets a new state-of-the-art across all three major benchmarks (QVHighlights, Charades-STA, ActivityNet-Captions), with a notable 3.23% mAP@avg improvement on the challenging QVHighlights dataset.
Related papers
- Seeing Through Words: Controlling Visual Retrieval Quality with Language Models [68.49490036960559]
We propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality.<n>Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms.<n>Our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries.
arXiv Detail & Related papers (2026-02-24T18:20:57Z) - HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval [39.457158192955106]
We propose a novel Composed Video Retrieval (CVR) framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD)<n>HUD is the first framework that leverages the disparity in information density between video and text to enhance multi-modal query understanding.<n>Our proposed HUD is also applicable to the Composed Image Retrieval (CIR) task and achieves state-of-the-art performance across three benchmark datasets for both CVR and CIR tasks.
arXiv Detail & Related papers (2025-12-02T14:10:16Z) - Temporal Grounding as a Learning Signal for Referring Video Object Segmentation [29.646697516547558]
Referring Video Object (RVOS) aims to segment and track objects in videos based on natural language expressions, requiring precise alignment between visual content and textual queries.<n>Existing methods often suffer from semantic misalignment, largely due to indiscriminate frame sampling and supervision of all visible objects during training.<n>We introduce MeViS-M, a dataset built upon the challenging MeViS benchmark, where we manually annotate temporal spans when each object is referred to by the expression.
arXiv Detail & Related papers (2025-08-16T07:34:43Z) - Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video
Moment Retrieval [31.42856682276394]
Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed videos corresponding to a given language query.
Existing strategies are often sub-optimal since they ignore the modality imbalance problem.
We introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework for more balanced alignment.
arXiv Detail & Related papers (2023-12-19T13:38:48Z) - RTQ: Rethinking Video-language Understanding Based on Image-text Model [55.278942477715084]
Video-language understanding presents unique challenges due to the inclusion of highly complex semantic details.
We propose a novel framework called RTQ, which addresses these challenges simultaneously.
Our model demonstrates outstanding performance even in the absence of video-language pre-training.
arXiv Detail & Related papers (2023-12-01T04:51:01Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - Video Referring Expression Comprehension via Transformer with
Content-aware Query [60.89442448993627]
Video Referring Expression (REC) aims to localize a target object in video frames referred by the natural language expression.
We argue that the current query design is suboptima and suffers from two drawbacks.
We set up a fixed number of learnable bounding boxes across the frame and the aligned region features are employed to provide fruitful clues.
arXiv Detail & Related papers (2022-10-06T14:45:41Z) - Boosting Video-Text Retrieval with Explicit High-Level Semantics [115.66219386097295]
We propose a novel visual-linguistic aligning model named HiSE for VTR.
It improves the cross-modal representation by incorporating explicit high-level semantics.
Our method achieves the superior performance over state-of-the-art methods on three benchmark datasets.
arXiv Detail & Related papers (2022-08-08T15:39:54Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.