OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
- URL: http://arxiv.org/abs/2209.07526v1
- Date: Thu, 15 Sep 2022 17:59:59 GMT
- Title: OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
- Authors: Junke Wang and Dongdong Chen and Zuxuan Wu and Chong Luo and Luowei
Zhou and Yucheng Zhao and Yujia Xie and Ce Liu and Yu-Gang Jiang and Lu Yuan
- Abstract summary: We present OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer.
We introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together.
- Score: 117.57580168859512
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents OmniVL, a new foundation model to support both
image-language and video-language tasks using one universal architecture. It
adopts a unified transformer-based visual encoder for both image and video
inputs, and thus can perform joint image-language and video-language
pretraining. We demonstrate, for the first time, such a paradigm benefits both
image and video tasks, as opposed to the conventional one-directional transfer
(e.g., use image-language to help video-language). To this end, we propose a
decoupled joint pretraining of image-language and video-language to effectively
decompose the vision-language modeling into spatial and temporal dimensions and
obtain performance boost on both image and video tasks. Moreover, we introduce
a novel unified vision-language contrastive (UniVLC) loss to leverage
image-text, video-text, image-label (e.g., image classification), video-label
(e.g., video action recognition) data together, so that both supervised and
noisily supervised pretraining data are utilized as much as possible. Without
incurring extra task-specific adaptors, OmniVL can simultaneously support
visual only tasks (e.g., image classification, video action recognition),
cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modal
understanding and generation tasks (e.g., image/video question answering,
captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve
state-of-the-art or competitive results with similar model size and data scale.
Related papers
- Unified Video-Language Pre-training with Synchronized Audio [21.607860535968356]
We propose an enhanced framework for Video-Language pre-training with Synchronized Audio.
Our framework learns tri-modal representations in a unified self-supervised transformer.
Our model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-12T07:59:46Z) - VL-GPT: A Generative Pre-trained Transformer for Vision and Language
Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data.
VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and
Dataset [53.46019570679092]
We propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation.
VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
It achieves new state-of-the-art performances on series of public cross-modality benchmarks.
arXiv Detail & Related papers (2023-04-17T15:08:15Z) - Learning video embedding space with Natural Language Supervision [1.6822770693792823]
We propose a novel approach to map video embedding space to natural langugage.
We propose a two-stage approach that first extracts visual features from each frame of a video using a pre-trained CNN, and then uses the CLIP model to encode the visual features for the video domain.
arXiv Detail & Related papers (2023-03-25T23:24:57Z) - MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action
Recognition with Language Knowledge [35.45809761628721]
Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities.
We propose an unsupervised approach to tuning video data for best zero-shot action recognition performance.
Our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks.
arXiv Detail & Related papers (2023-03-15T20:17:41Z) - Aligning Source Visual and Target Language Domains for Unpaired Video
Captioning [97.58101383280345]
Training supervised video captioning model requires coupled video-caption pairs.
We introduce the unpaired video captioning task aiming to train models without coupled video-caption pairs in target language.
arXiv Detail & Related papers (2022-11-22T10:26:26Z) - VL-BEiT: Generative Vision-Language Pretraining [107.25298505511184]
We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining.
Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images.
arXiv Detail & Related papers (2022-06-02T16:14:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.