Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
- URL: http://arxiv.org/abs/2308.08414v1
- Date: Wed, 16 Aug 2023 15:00:50 GMT
- Title: Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
- Authors: Guangyi Chen, Xiao Liu, Guangrun Wang, Kun Zhang, Philip H.S.Torr,
Xiao-Ping Zhang, Yansong Tang
- Abstract summary: Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks.
Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones.
This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
- Score: 79.20605034378187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-language pre-trained models have shown remarkable success in guiding
video question-answering (VideoQA) tasks. However, due to the length of video
sequences, training large-scale video-based models incurs considerably higher
costs than training image-based ones. This motivates us to leverage the
knowledge from image-based pretraining, despite the obvious gaps between image
and video domains. To bridge these gaps, in this paper, we propose Tem-Adapter,
which enables the learning of temporal dynamics and complex semantics by a
visual Temporal Aligner and a textual Semantic Aligner. Unlike conventional
pretrained knowledge adaptation methods that only concentrate on the downstream
task objective, the Temporal Aligner introduces an extra language-guided
autoregressive task aimed at facilitating the learning of temporal
dependencies, with the objective of predicting future states based on
historical clues and language guidance that describes event progression.
Besides, to reduce the semantic gap and adapt the textual representation for
better event description, we introduce a Semantic Aligner that first designs a
template to fuse question and answer pairs as event descriptions and then
learns a Transformer decoder with the whole video sequence as guidance for
refinement. We evaluate Tem-Adapter and different pre-train transferring
methods on two VideoQA benchmarks, and the significant performance improvement
demonstrates the effectiveness of our method.
Related papers
- Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward.
The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks.
This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z) - COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - Long-Form Video-Language Pre-Training with Multimodal Temporal
Contrastive Learning [39.80936685227549]
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks.
We introduce a Long-Form VIdeo-LAnguage pre-training model (VILA) and train it on a large-scale long-form video and paragraph dataset.
We fine-tune the model on seven downstream long-form video-language understanding tasks, achieve new state-of-the-art performances.
arXiv Detail & Related papers (2022-10-12T09:08:27Z) - Learning Transferable Spatiotemporal Representations from Natural Script
Knowledge [65.40899722211726]
We introduce a new pretext task, Turning to Video Transcript for ASR (TVTS), which sorts scripts by attending to learned video representations.
The advantages enable our model to contextualize what is happening like human beings and seamlessly apply to large-scale uncurated video data in the real world.
arXiv Detail & Related papers (2022-09-30T07:39:48Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z) - CUPID: Adaptive Curation of Pre-training Data for Video-and-Language
Representation Learning [49.18591896085498]
We propose CUPID to bridge the domain gap between source and target data.
CUPID yields new state-of-the-art performance across multiple video-language and video tasks.
arXiv Detail & Related papers (2021-04-01T06:42:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.