Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models
- URL: http://arxiv.org/abs/2309.00661v1
- Date: Fri, 1 Sep 2023 13:06:50 GMT
- Title: Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models
- Authors: Dezhao Luo, Jiabo Huang, Shaogang Gong, Hailin Jin, Yang Liu
- Abstract summary: We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
- Score: 58.17315970207874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate video moment retrieval (VMR) requires universal visual-textual
correlations that can handle unknown vocabulary and unseen scenes. However, the
learned correlations are likely either biased when derived from a limited
amount of moment-text data which is hard to scale up because of the prohibitive
annotation cost (fully-supervised), or unreliable when only the video-text
pairwise relationships are available without fine-grained temporal annotations
(weakly-supervised). Recently, the vision-language models (VLM) demonstrate a
new transfer learning paradigm to benefit different vision tasks through the
universal visual-textual correlations derived from large-scale vision-language
pairwise web data, which has also shown benefits to VMR by fine-tuning in the
target domains. In this work, we propose a zero-shot method for adapting
generalisable visual-textual priors from arbitrary VLM to facilitate
moment-text alignment, without the need for accessing the VMR data. To this
end, we devise a conditional feature refinement module to generate
boundary-aware visual features conditioned on text queries to enable better
moment boundary understanding. Additionally, we design a bottom-up proposal
generation strategy that mitigates the impact of domain discrepancies and
breaks down complex-query retrieval tasks into individual action retrievals,
thereby maximizing the benefits of VLM. Extensive experiments conducted on
three VMR benchmark datasets demonstrate the notable performance advantages of
our zero-shot algorithm, especially in the novel-word and novel-location
out-of-distribution setups.
Related papers
- GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval [56.610806615527885]
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video.
By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions.
GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.
arXiv Detail & Related papers (2024-08-14T01:24:09Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Hybrid-Learning Video Moment Retrieval across Multi-Domain Labels [34.88705952395676]
Video moment retrieval (VMR) is to search for a visual temporal moment in an untrimmed raw video by a given text query description (sentence)
We introduce a new approach called hybrid-learning video moment retrieval to solve the problem by knowledge transfer.
Our aim is to explore shared universal knowledge between the two domains in order to improve model learning in the weakly-labelled target domain.
arXiv Detail & Related papers (2024-06-03T21:14:53Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Context-Enhanced Video Moment Retrieval with Large Language Models [22.283367604425916]
Current methods for Video Moment Retrieval (VMR) struggle to align complex situations involving specific environmental details, character descriptions, and action narratives.
We propose a Large Language Model-guided Moment Retrieval (LMR) approach that employs the extensive knowledge of Large Language Models (LLMs) to improve video context representation.
Extensive experiments demonstrate that LMR achieves state-of-the-art results, outperforming the nearest competitor by up to 3.28% and 4.06% on the challenging QVHighlights and Charades-STA benchmarks.
arXiv Detail & Related papers (2024-05-21T07:12:27Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Visual Spatio-temporal Relation-enhanced Network for Cross-modal
Text-Video Retrieval [17.443195531553474]
Cross-modal retrieval of texts and videos aims to understand the correspondence between vision and language.
We propose a Visual S-temporal Relation-enhanced semantic network (CNN-SRNet), a cross-temporal retrieval framework.
Experiments are conducted on both MSR-VTT and MSVD datasets.
arXiv Detail & Related papers (2021-10-29T08:23:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.