Clover: Towards A Unified Video-Language Alignment and Fusion Model
- URL: http://arxiv.org/abs/2207.07885v1
- Date: Sat, 16 Jul 2022 09:38:52 GMT
- Title: Clover: Towards A Unified Video-Language Alignment and Fusion Model
- Authors: Jingjia Huang, Yinan Li, Jiashi Feng, Xiaoshuai Sun and Rongrong Ji
- Abstract summary: We introduce Clover, a Correlated Video-Language pre-training method.
It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task.
Clover establishes new state-of-the-arts on multiple downstream tasks.
- Score: 154.1070559563592
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building a universal video-language model for solving various video
understanding tasks (e.g., text-video retrieval, video question answering) is
an open challenge to the machine learning field. Towards this goal, most recent
attempts train the models, usually consisting of uni-modal and cross-modal
feature encoders, with supervised or pair-wise contrastive pre-text tasks.
Though offering attractive generality, the resulted models have to compromise
between efficiency and performance. We argue the flaws are caused by their
pre-training strategies\textemdash they cannot well align and fuse features
from different modalities simultaneously. We then introduce Clover -- a
Correlated Video-Language pre-training method -- towards a universal
video-language model for solving multiple video understanding tasks with
neither performance nor efficiency compromise. It improves cross-modal feature
alignment and fusion via a novel tri-modal alignment pre-training task.
Additionally, we propose to enhance the tri-modal alignment via incorporating
learning from masked samples and a novel pair-wise ranking loss. Clover
demonstrates outstanding generality. It establishes new state-of-the-arts on
multiple downstream tasks, including three retrieval tasks for both zero-shot
and fine-tuning settings, and eight video question answering tasks. Codes and
pre-trained models will be released at https://github.com/LeeYN-43/Clover.
Related papers
- MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z) - LAVENDER: Unifying Video-Language Understanding as Masked Language
Modeling [102.42424022921243]
Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks.
Experiments show that this unified framework achieves competitive performance on 14 VidL benchmarks.
arXiv Detail & Related papers (2022-06-14T20:43:25Z) - All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations.
The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z) - UniVL: A Unified Video and Language Pre-Training Model for Multimodal
Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation.
It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone.
We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.