Prompting Visual-Language Models for Efficient Video Understanding
- URL: http://arxiv.org/abs/2112.04478v1
- Date: Wed, 8 Dec 2021 18:58:16 GMT
- Title: Prompting Visual-Language Models for Efficient Video Understanding
- Authors: Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, Weidi Xie
- Abstract summary: This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
- Score: 28.754997650215486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual-language pre-training has shown great success for learning joint
visual-textual representations from large-scale web data, demonstrating
remarkable ability for zero-shot generalisation. This paper presents a simple
method to efficiently adapt one pre-trained visual-language model to novel
tasks with minimal training, and here, we consider video understanding tasks.
Specifically, we propose to optimise a few random vectors, termed as continuous
prompt vectors, that convert the novel tasks into the same format as the
pre-training objectives. In addition, to bridge the gap between static images
and videos, temporal information is encoded with lightweight Transformers
stacking on top of frame-wise visual features. Experimentally, we conduct
extensive ablation studies to analyse the critical components and necessities.
On 9 public benchmarks of action recognition, action localisation, and
text-video retrieval, across closed-set, few-shot, open-set scenarios, we
achieve competitive or state-of-the-art performance to existing methods,
despite training significantly fewer parameters.
Related papers
- Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward.
The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks.
This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z) - VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks.
Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones.
This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
arXiv Detail & Related papers (2023-08-16T15:00:50Z) - Knowledge Prompting for Few-shot Action Recognition [20.973999078271483]
We propose a simple yet effective method, called knowledge prompting, to prompt a powerful vision-language model for few-shot classification.
We first collect large-scale language descriptions of actions, defined as text proposals, to build an action knowledge base.
We feed these text proposals into the pre-trained vision-language model along with video frames to generate matching scores of the proposals to each frame.
Extensive experiments on six benchmark datasets demonstrate that our method generally achieves the state-of-the-art performance while reducing the training overhead to 0.001 of existing methods.
arXiv Detail & Related papers (2022-11-22T06:05:17Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z) - TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization
Tasks [79.01176229586855]
We propose a novel supervised pretraining paradigm for clip features that considers background clips and global video information to improve temporal sensitivity.
Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks.
arXiv Detail & Related papers (2020-11-23T15:40:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.