Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models
- URL: http://arxiv.org/abs/2005.07310v2
- Date: Sat, 18 Jul 2020 23:10:35 GMT
- Title: Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models
- Authors: Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen and Jingjing
Liu
- Abstract summary: Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
- Score: 65.19308052012858
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent Transformer-based large-scale pre-trained models have revolutionized
vision-and-language (V+L) research. Models such as ViLBERT, LXMERT and UNITER
have significantly lifted state of the art across a wide range of V+L
benchmarks with joint image-text pre-training. However, little is known about
the inner mechanisms that destine their impressive success. To reveal the
secrets behind the scene of these powerful models, we present VALUE
(Vision-And-Language Understanding Evaluation), a set of meticulously designed
probing tasks (e.g., Visual Coreference Resolution, Visual Relation Detection,
Linguistic Probing Tasks) generalizable to standard pre-trained V+L models,
aiming to decipher the inner workings of multimodal pre-training (e.g., the
implicit knowledge garnered in individual attention heads, the inherent
cross-modal alignment learned through contextualized multimodal embeddings).
Through extensive analysis of each archetypal model architecture via these
probing tasks, our key observations are: (i) Pre-trained models exhibit a
propensity for attending over text rather than images during inference. (ii)
There exists a subset of attention heads that are tailored for capturing
cross-modal interactions. (iii) Learned attention matrix in pre-trained models
demonstrates patterns coherent with the latent alignment between image regions
and textual words. (iv) Plotted attention patterns reveal
visually-interpretable relations among image regions. (v) Pure linguistic
knowledge is also effectively encoded in the attention heads. These are
valuable insights serving to guide future work towards designing better model
architecture and objectives for multimodal pre-training.
Related papers
- GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception
Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations.
We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment
Analysis [25.482853330324748]
Multimodal Aspect-Based Sentiment Analysis (MABSA) has attracted increasing attention in recent years.
Previous approaches either (i) use separately pre-trained visual and textual models, which ignore the crossmodal alignment or (ii) use vision-grained models pre-trained with general pre-training tasks.
We propose a task-specific Vision-Language Pre-training framework for MABSA (MABSA), which is a unified multimodal encoder-decoder architecture for all the pretraining and downstream tasks.
arXiv Detail & Related papers (2022-04-17T08:44:00Z) - A Survey of Vision-Language Pre-Trained Models [41.323956143107644]
Pre-trained models have advanced at a breakneck pace in recent years.
How to adapt pre-training to the field of Vision-and-Language learning and improve the performance on downstream tasks becomes a focus of multimodal learning.
arXiv Detail & Related papers (2022-02-18T15:15:46Z) - ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and
Intra-modal Knowledge Integration [48.01536973731182]
We introduce a new vision-and-language pretraining method called ROSITA.
It integrates the cross- and intra-modal knowledge in a unified scene graph to enhance the semantic alignments.
ROSITA significantly outperforms existing state-of-the-art methods on three typical vision-and-language tasks over six benchmark datasets.
arXiv Detail & Related papers (2021-08-16T13:16:58Z) - WenLan: Bridging Vision and Language by Large-Scale Multi-Modal
Pre-Training [71.37731379031487]
We propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework.
Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario.
By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources.
arXiv Detail & Related papers (2021-03-11T09:39:49Z) - Unsupervised Vision-and-Language Pre-training Without Parallel Images
and Captions [92.47566804182338]
We investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora.
In particular, we propose to conduct mask-and-predict'' pre-training on text-only and image-only corpora.
We find that such a simple approach performance close to a model pre-trained with aligned data, on four English V&L benchmarks.
arXiv Detail & Related papers (2020-10-24T08:17:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.