LocVTP: Video-Text Pre-training for Temporal Localization
- URL: http://arxiv.org/abs/2207.10362v1
- Date: Thu, 21 Jul 2022 08:43:51 GMT
- Title: LocVTP: Video-Text Pre-training for Temporal Localization
- Authors: Meng Cao, Tianyu Yang, Junwu Weng, Can Zhang, Jue Wang, and Yuexian
Zou
- Abstract summary: Video-Text Pre-training aims to learn transferable representations for various downstream tasks from large-scale web videos.
In this paper, we experimentally analyze and demonstrate the incompatibility of current VTP methods with localization tasks.
We propose a novel localization-oriented Video-Text Pre-training framework, dubbed as LocVTP.
- Score: 71.74284893790092
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-Text Pre-training (VTP) aims to learn transferable representations for
various downstream tasks from large-scale web videos. To date, almost all
existing VTP methods are limited to retrieval-based downstream tasks, e.g.,
video retrieval, whereas their transfer potentials on localization-based tasks,
e.g., temporal grounding, are under-explored. In this paper, we experimentally
analyze and demonstrate the incompatibility of current VTP methods with
localization tasks, and propose a novel Localization-oriented Video-Text
Pre-training framework, dubbed as LocVTP. Specifically, we perform the
fine-grained contrastive alignment as a complement to the coarse-grained one by
a clip-word correspondence discovery scheme. To further enhance the temporal
reasoning ability of the learned feature, we propose a context projection head
and a temporal aware contrastive loss to perceive the contextual relationships.
Extensive experiments on four downstream tasks across six datasets demonstrate
that our LocVTP achieves state-of-the-art performance on both retrieval-based
and localization-based tasks. Furthermore, we conduct comprehensive ablation
studies and thorough analyses to explore the optimum model designs and training
strategies.
Related papers
- Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model [13.983810804606264]
We propose In-Context Prompt Learning (InCPL) for test-time visual recognition tasks.
InCPL associates a new test sample with very few labeled examples as context information.
We introduce a context-aware unsupervised loss to optimize visual prompts tailored to test samples.
arXiv Detail & Related papers (2024-03-10T08:15:51Z) - Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Exploiting the Textual Potential from Vision-Language Pre-training for
Text-based Person Search [17.360982091304137]
Text-based Person Search (TPS) is targeted on retrieving pedestrians to match text descriptions instead of query images.
Recent Vision-Language Pre-training models can bring transferable knowledge to downstream TPS tasks, resulting in more efficient performance gains.
However, existing TPS methods only utilize pre-trained visual encoders, neglecting the corresponding textual representation.
arXiv Detail & Related papers (2023-03-08T10:41:22Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training [49.52679453475878]
We propose a Temporal-Aware video-language pre-training framework, HiTeA, for modeling cross-modal alignment between moments and texts.
We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks.
arXiv Detail & Related papers (2022-12-30T04:27:01Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.