Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for
Vision-Language Tasks
- URL: http://arxiv.org/abs/2204.10496v1
- Date: Fri, 22 Apr 2022 04:41:04 GMT
- Title: Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for
Vision-Language Tasks
- Authors: Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Xiyang Dai, Bin
Xiao, Jianwei Yang, Haoxuan You, Kai-Wei Chang, Shih-fu Chang, Lu Yuan
- Abstract summary: Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully curated vision-language datasets.
Unimodal encoders are pretrained with simpler annotations that are less cost-prohibitive, achieving scales of hundreds of millions to billions.
We propose Multimodal Adaptive Distillation (MAD), which adaptively distills useful knowledge from pretrained encoders to cross-modal VL encoders.
- Score: 118.49566068398642
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Cross-modal encoders for vision-language (VL) tasks are often pretrained with
carefully curated vision-language datasets. While these datasets reach an order
of 10 million samples, the labor cost is prohibitive to scale further.
Conversely, unimodal encoders are pretrained with simpler annotations that are
less cost-prohibitive, achieving scales of hundreds of millions to billions. As
a result, unimodal encoders have achieved state-of-art (SOTA) on many
downstream tasks. However, challenges remain when applying to VL tasks. The
pretraining data is not optimal for cross-modal architectures and requires
heavy computational resources. In addition, unimodal architectures lack
cross-modal interactions that have demonstrated significant benefits for VL
tasks. Therefore, how to best leverage pretrained unimodal encoders for VL
tasks is still an area of active research. In this work, we propose a method to
leverage unimodal vision and text encoders for VL tasks that augment existing
VL approaches while conserving computational complexity. Specifically, we
propose Multimodal Adaptive Distillation (MAD), which adaptively distills
useful knowledge from pretrained encoders to cross-modal VL encoders. Second,
to better capture nuanced impacts on VL task performance, we introduce an
evaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual
Entailment (SNLI-VE), and Visual Question Answering (VQA), across a variety of
data constraints and conditions of domain shift. Experiments demonstrate that
MAD leads to consistent gains in the low-shot, domain-shifted, and
fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA
performance on VCR compared to other single models pretrained with image-text
data. Finally, MAD outperforms concurrent works utilizing pretrained vision
encoder from CLIP. Code will be made available.
Related papers
- Unveiling Encoder-Free Vision-Language Models [62.52803514667452]
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks.
We bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs.
We launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently.
arXiv Detail & Related papers (2024-06-17T17:59:44Z) - ManagerTower: Aggregating the Insights of Uni-Modal Experts for
Vision-Language Representation Learning [73.47165576175541]
Two-Tower Vision-Language (VL) models have shown promising improvements on various downstream tasks.
We propose ManagerTower, a novel VL model architecture that gathers and combines the insights of pre-trained uni-modal experts at different levels.
arXiv Detail & Related papers (2023-05-31T18:23:57Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - Scheduled Sampling in Vision-Language Pretraining with Decoupled
Encoder-Decoder Network [99.03895740754402]
We propose a two-stream decoupled design of encoder-decoder structure, in which two decoupled cross-modal encoder and decoder are involved.
As an alternative, we propose a primary scheduled sampling strategy that mitigates such discrepancy via pretraining encoder-decoder in a two-pass manner.
arXiv Detail & Related papers (2021-01-27T17:36:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.