Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training
- URL: http://arxiv.org/abs/2303.00040v2
- Date: Fri, 24 Mar 2023 17:35:07 GMT
- Title: Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training
- Authors: Dezhao Luo, Jiabo Huang, Shaogang Gong, Hailin Jin, Yang Liu
- Abstract summary: The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
- Score: 70.83385449872495
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The correlation between the vision and text is essential for video moment
retrieval (VMR), however, existing methods heavily rely on separate
pre-training feature extractors for visual and textual understanding. Without
sufficient temporal boundary annotations, it is non-trivial to learn universal
video-text alignments. In this work, we explore multi-modal correlations
derived from large-scale image-text data to facilitate generalisable VMR. To
address the limitations of image-text pre-training models on capturing the
video changes, we propose a generic method, referred to as Visual-Dynamic
Injection (VDI), to empower the model's understanding of video moments. Whilst
existing VMR methods are focusing on building temporal-aware video features,
being aware of the text descriptions about the temporal changes is also
critical but originally overlooked in pre-training by matching static images
with sentences. Therefore, we extract visual context and spatial dynamic
information from video frames and explicitly enforce their alignments with the
phrases describing video changes (e.g. verb). By doing so, the potentially
relevant visual and motion patterns in videos are encoded in the corresponding
text embeddings (injected) so to enable more accurate video-text alignments. We
conduct extensive experiments on two VMR benchmark datasets (Charades-STA and
ActivityNet-Captions) and achieve state-of-the-art performances. Especially,
VDI yields notable advantages when being tested on the out-of-distribution
splits where the testing samples involve novel scenes and vocabulary.
Related papers
- MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Hybrid-Learning Video Moment Retrieval across Multi-Domain Labels [34.88705952395676]
Video moment retrieval (VMR) is to search for a visual temporal moment in an untrimmed raw video by a given text query description (sentence)
We introduce a new approach called hybrid-learning video moment retrieval to solve the problem by knowledge transfer.
Our aim is to explore shared universal knowledge between the two domains in order to improve model learning in the weakly-labelled target domain.
arXiv Detail & Related papers (2024-06-03T21:14:53Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - VicTR: Video-conditioned Text Representations for Activity Recognition [73.09929391614266]
We argue that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information.
We introduce Video-conditioned Text Representations (VicTR), a form of text embeddings optimized w.r.t. visual embeddings.
Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text.
arXiv Detail & Related papers (2023-04-05T16:30:36Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - CLIP2Video: Mastering Video-Text Retrieval via Image CLIP [13.270902407320005]
We present CLIP2Video network to transfer the image-language training model to video-text retrieval in an end-to-end manner.
We conduct thorough ablation studies, and achieve state-of-the-art performance on text-to-video and video-to-text retrieval benchmarks.
arXiv Detail & Related papers (2021-06-21T13:30:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.