VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending
- URL: http://arxiv.org/abs/2305.13167v1
- Date: Mon, 22 May 2023 15:54:22 GMT
- Title: VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending
- Authors: Xingjian He, Sihan Chen, Fan Ma, Zhicheng Huang, Xiaojie Jin, Zikang
Liu, Dongmei Fu, Yi Yang, Jing Liu, Jiashi Feng
- Abstract summary: Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
- Score: 78.1399386935455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale image-text contrastive pre-training models, such as CLIP, have
been demonstrated to effectively learn high-quality multimodal representations.
However, there is limited research on learning video-text representations for
general video multimodal tasks based on these powerful features. Towards this
goal, we propose a novel video-text pre-training method dubbed VLAB: Video
Language pre-training by feature Adapting and Blending, which transfers CLIP
representations to video pre-training tasks and develops unified video
multimodal models for a wide range of video-text tasks. Specifically, VLAB is
founded on two key strategies: feature adapting and feature blending. In the
former, we introduce a new video adapter module to address CLIP's deficiency in
modeling temporal information and extend the model's capability to encompass
both contrastive and generative tasks. In the latter, we propose an end-to-end
training method that further enhances the model's performance by exploiting the
complementarity of image and video features. We validate the effectiveness and
versatility of VLAB through extensive experiments on highly competitive video
multimodal tasks, including video text retrieval, video captioning, and video
question answering. Remarkably, VLAB outperforms competing methods
significantly and sets new records in video question answering on MSRVTT, MSVD,
and TGIF datasets. It achieves an accuracy of 49.6, 61.0, and 79.0,
respectively. Codes and models will be released.
Related papers
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - OmniVL:One Foundation Model for Image-Language and Video-Language Tasks [117.57580168859512]
We present OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer.
We introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together.
arXiv Detail & Related papers (2022-09-15T17:59:59Z) - Advancing High-Resolution Video-Language Representation with Large-Scale
Video Transcriptions [31.4943447481144]
We study joint and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream tasks.
Our model achieves new state-of-the-art results in 10 understanding tasks and 2 more novel text-to-visual generation tasks.
arXiv Detail & Related papers (2021-11-19T17:36:01Z) - CLIP4Caption: CLIP for Video Caption [9.470254059503862]
We propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM)
This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation.
arXiv Detail & Related papers (2021-10-13T10:17:06Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - UniVL: A Unified Video and Language Pre-Training Model for Multimodal
Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation.
It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone.
We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.