ManagerTower: Aggregating the Insights of Uni-Modal Experts for
Vision-Language Representation Learning
- URL: http://arxiv.org/abs/2306.00103v1
- Date: Wed, 31 May 2023 18:23:57 GMT
- Title: ManagerTower: Aggregating the Insights of Uni-Modal Experts for
Vision-Language Representation Learning
- Authors: Xiao Xu, Bei Li, Chenfei Wu, Shao-Yen Tseng, Anahita Bhiwandiwalla,
Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan
- Abstract summary: Two-Tower Vision-Language (VL) models have shown promising improvements on various downstream tasks.
We propose ManagerTower, a novel VL model architecture that gathers and combines the insights of pre-trained uni-modal experts at different levels.
- Score: 73.47165576175541
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Two-Tower Vision-Language (VL) models have shown promising improvements on
various downstream VL tasks. Although the most advanced work improves
performance by building bridges between encoders, it suffers from ineffective
layer-by-layer utilization of uni-modal representations and cannot flexibly
exploit different levels of uni-modal semantic knowledge. In this work, we
propose ManagerTower, a novel VL model architecture that gathers and combines
the insights of pre-trained uni-modal experts at different levels. The managers
introduced in each cross-modal layer can adaptively aggregate uni-modal
semantic knowledge to facilitate more comprehensive cross-modal alignment and
fusion. ManagerTower outperforms previous strong baselines both with and
without Vision-Language Pre-training (VLP). With only 4M VLP data, ManagerTower
achieves superior performances on various downstream VL tasks, especially
79.15% accuracy on VQAv2 Test-Std, 86.56% IR@1 and 95.64% TR@1 on Flickr30K.
Code and checkpoints are available at https://github.com/LooperXX/ManagerTower.
Related papers
- NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning [91.93547262073213]
Vision-Language (VL) models with the Two-Tower architecture have dominated visual representation learning in recent years.
We propose BridgeTower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder.
BridgeTower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data.
arXiv Detail & Related papers (2022-06-17T09:42:35Z) - Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for
Vision-Language Tasks [118.49566068398642]
Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully curated vision-language datasets.
Unimodal encoders are pretrained with simpler annotations that are less cost-prohibitive, achieving scales of hundreds of millions to billions.
We propose Multimodal Adaptive Distillation (MAD), which adaptively distills useful knowledge from pretrained encoders to cross-modal VL encoders.
arXiv Detail & Related papers (2022-04-22T04:41:04Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.