UniVL: A Unified Video and Language Pre-Training Model for Multimodal
Understanding and Generation
- URL: http://arxiv.org/abs/2002.06353v3
- Date: Tue, 15 Sep 2020 13:27:13 GMT
- Title: UniVL: A Unified Video and Language Pre-Training Model for Multimodal
Understanding and Generation
- Authors: Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li,
Jason Li, Taroon Bharti, Ming Zhou
- Abstract summary: This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation.
It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone.
We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
- Score: 76.12027504427708
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the recent success of the pre-training technique for NLP and
image-linguistic tasks, some video-linguistic pre-training works are gradually
developed to improve video-text related downstream tasks. However, most of the
existing multimodal models are pre-trained for understanding tasks, leading to
a pretrain-finetune discrepancy for generation tasks. This paper proposes
UniVL: a Unified Video and Language pre-training model for both multimodal
understanding and generation. It comprises four components, including two
single-modal encoders, a cross encoder, and a decoder with the Transformer
backbone. Five objectives, including video-text joint, conditioned masked
language model (CMLM), conditioned masked frame model (CMFM), video-text
alignment, and language reconstruction, are designed to train each of the
components. We further develop two pre-training strategies, stage by stage
pre-training (StagedP) and enhanced video representation (EnhancedV), to make
the training process of the UniVL more effective. The pre-train is carried out
on a sizeable instructional video dataset HowTo100M. Experimental results
demonstrate that the UniVL can learn strong video-text representation and
achieves state-of-the-art results on five downstream tasks.
Related papers
- VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - VideoPoet: A Large Language Model for Zero-Shot Video Generation [78.57171527944774]
VideoPoet is a language model capable of synthesizing high-quality video with matching audio.
VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs.
arXiv Detail & Related papers (2023-12-21T18:46:41Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - End-to-end Generative Pretraining for Multimodal Video Captioning [82.79187814057313]
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos.
Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly.
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks.
arXiv Detail & Related papers (2022-01-20T16:16:21Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.