Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training
- URL: http://arxiv.org/abs/2104.09411v1
- Date: Mon, 19 Apr 2021 15:58:45 GMT
- Title: Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training
- Authors: Chenyi Lei, Shixian Luo, Yong Liu, Wanggui He, Jiamang Wang, Guoxin
Wang, Haihong Tang, Chunyan Miao, Houqiang Li
- Abstract summary: We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
- Score: 79.88705563918413
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The pre-trained neural models have recently achieved impressive performances
in understanding multimodal content. However, it is still very challenging to
pre-train neural models for video and language understanding, especially for
Chinese video-language data, due to the following reasons. Firstly, existing
video-language pre-training algorithms mainly focus on the co-occurrence of
words and video frames, but ignore other valuable semantic and structure
information of video-language content, e.g., sequential order and
spatiotemporal relationships. Secondly, there exist conflicts between video
sentence alignment and other proxy tasks. Thirdly, there is a lack of
large-scale and high-quality Chinese video-language datasets (e.g., including
10 million unique videos), which are the fundamental success conditions for
pre-training techniques.
In this work, we propose a novel video-language understanding framework named
VICTOR, which stands for VIdeo-language understanding via Contrastive
mulTimOdal pRe-training. Besides general proxy tasks such as masked language
modeling, VICTOR constructs several novel proxy tasks under the contrastive
learning paradigm, making the model be more robust and able to capture more
complex multimodal semantic and structural relationships from different
perspectives. VICTOR is trained on a large-scale Chinese video-language
dataset, including over 10 million complete videos with corresponding
high-quality textual descriptions. We apply the pre-trained VICTOR model to a
series of downstream applications and demonstrate its superior performances,
comparing against the state-of-the-art pre-training methods such as VideoBERT
and UniVL. The codes and trained checkpoints will be publicly available to
nourish further developments of the research community.
Related papers
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training [49.52679453475878]
We propose a Temporal-Aware video-language pre-training framework, HiTeA, for modeling cross-modal alignment between moments and texts.
We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks.
arXiv Detail & Related papers (2022-12-30T04:27:01Z) - Grafting Pre-trained Models for Multimodal Headline Generation [12.063053852096514]
Multimodal headline utilizes both video frames and transcripts to generate the natural language title of the videos.
Previous researches on pre-trained language models and video-language models have achieved significant progress in related downstream tasks.
We propose a novel approach to graft the video encoder from the pre-trained video-language model on the generative pre-trained language model.
arXiv Detail & Related papers (2022-11-14T08:59:59Z) - End-to-end Generative Pretraining for Multimodal Video Captioning [82.79187814057313]
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos.
Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly.
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks.
arXiv Detail & Related papers (2022-01-20T16:16:21Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - UniVL: A Unified Video and Language Pre-Training Model for Multimodal
Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation.
It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone.
We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.