All in One: Exploring Unified Video-Language Pre-training
- URL: http://arxiv.org/abs/2203.07303v1
- Date: Mon, 14 Mar 2022 17:06:30 GMT
- Title: All in One: Exploring Unified Video-Language Pre-training
- Authors: Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu
Cai, Jianping Wu, Ying Shan, Xiaohu Qie, Mike Zheng Shou
- Abstract summary: We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations.
The code and pretrained model have been released in https://github.com/showlab/all-in-one.
- Score: 44.22059872694995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mainstream Video-Language Pre-training models \cite{actbert,clipbert,violet}
consist of three parts, a video encoder, a text encoder, and a video-text
fusion Transformer. They pursue better performance via utilizing heavier
unimodal encoders or multimodal fusion Transformers, resulting in increased
parameters with lower efficiency in downstream tasks. In this work, we for the
first time introduce an end-to-end video-language model, namely
\textit{all-in-one Transformer}, that embeds raw video and textual signals into
joint representations using a unified backbone architecture. We argue that the
unique temporal information of video data turns out to be a key barrier
hindering the design of a modality-agnostic Transformer. To overcome the
challenge, we introduce a novel and effective token rolling operation to encode
temporal representations from video clips in a non-parametric manner. The
careful design enables the representation learning of both video-text
multimodal inputs and unimodal inputs using a unified backbone model. Our
pre-trained all-in-one Transformer is transferred to various downstream
video-text tasks after fine-tuning, including text-video retrieval,
video-question answering, multiple choice and visual commonsense reasoning.
State-of-the-art performances with the minimal model FLOPs on nine datasets
demonstrate the superiority of our method compared to the competitive
counterparts. The code and pretrained model have been released in
https://github.com/showlab/all-in-one.
Related papers
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Multimodal Frame-Scoring Transformer for Video Summarization [4.266320191208304]
Multimodal Frame-Scoring Transformer (MFST) framework exploiting visual, text and audio features and scoring a video with respect to frames.
MFST framework first extracts each modality features (visual-text-audio) using pretrained encoders.
MFST trains the multimodal frame-scoring transformer that uses video-text-audio representations as inputs and predicts frame-level scores.
arXiv Detail & Related papers (2022-07-05T05:14:15Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - CogVideo: Large-scale Pretraining for Text-to-Video Generation via
Transformers [16.255516347736535]
Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation.
CogVideo is trained by inheriting a pretrained text-to-image model, CogView2.
CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.
arXiv Detail & Related papers (2022-05-29T19:02:15Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z) - UniVL: A Unified Video and Language Pre-Training Model for Multimodal
Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation.
It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone.
We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.