Align and Prompt: Video-and-Language Pre-training with Entity Prompts
- URL: http://arxiv.org/abs/2112.09583v1
- Date: Fri, 17 Dec 2021 15:55:53 GMT
- Title: Align and Prompt: Video-and-Language Pre-training with Entity Prompts
- Authors: Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C.H.
Hoi
- Abstract summary: Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
- Score: 111.23364631136339
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video-and-language pre-training has shown promising improvements on various
downstream tasks. Most previous methods capture cross-modal interactions with a
transformer-based multimodal encoder, not fully addressing the misalignment
between unimodal video and text features. Besides, learning fine-grained
visual-language alignment usually requires off-the-shelf object detectors to
provide object information, which is bottlenecked by the detector's limited
vocabulary and expensive computation cost.
We propose Align and Prompt: an efficient and effective video-and-language
pre-training framework with better cross-modal alignment. First, we introduce a
video-text contrastive (VTC) loss to align unimodal video-text features at the
instance level, which eases the modeling of cross-modal interactions. Then, we
propose a new visually-grounded pre-training task, prompting entity modeling
(PEM), which aims to learn fine-grained region-entity alignment. To achieve
this, we first introduce an entity prompter module, which is trained with VTC
to produce the similarity between a video crop and text prompts instantiated
with entity names. The PEM task then asks the model to predict the entity
pseudo-labels (i.e~normalized similarity scores) for randomly-selected video
crops. The resulting pre-trained model achieves state-of-the-art performance on
both text-video retrieval and videoQA, outperforming prior work by a
substantial margin. Our code and pre-trained models will be released.
Related papers
- Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions.
We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations.
The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z) - UniVL: A Unified Video and Language Pre-Training Model for Multimodal
Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation.
It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone.
We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.