A Closer Look at the Robustness of Vision-and-Language Pre-trained
Models
- URL: http://arxiv.org/abs/2012.08673v2
- Date: Tue, 30 Mar 2021 23:51:50 GMT
- Title: A Closer Look at the Robustness of Vision-and-Language Pre-trained
Models
- Authors: Linjie Li, Zhe Gan, Jingjing Liu
- Abstract summary: Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have propelled the state of the art in vision-and-language (V+L) research to a new level.
Although achieving impressive performance on standard tasks, it still remains unclear how robust these pre-trained models are.
We propose Mango, a generic and efficient approach that learns a Multimodal Adversarial Noise GeneratOr in the embedding space to fool pre-trained V+L models.
- Score: 42.13369297087191
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER,
have propelled the state of the art in vision-and-language (V+L) research to a
new level. Although achieving impressive performance on standard tasks, to
date, it still remains unclear how robust these pre-trained models are. To
investigate, we conduct a host of thorough evaluations on existing pre-trained
models over 4 different types of V+L specific model robustness: (i) Linguistic
Variation; (ii) Logical Reasoning; (iii) Visual Content Manipulation; and (iv)
Answer Distribution Shift. Interestingly, by standard model finetuning,
pre-trained V+L models already exhibit better robustness than many
task-specific state-of-the-art methods. To further enhance model robustness, we
propose Mango, a generic and efficient approach that learns a Multimodal
Adversarial Noise GeneratOr in the embedding space to fool pre-trained V+L
models. Differing from previous studies focused on one specific type of
robustness, Mango is task-agnostic, and enables universal performance lift for
pre-trained models over diverse tasks designed to evaluate broad aspects of
robustness. Comprehensive experiments demonstrate that Mango achieves new state
of the art on 7 out of 9 robustness benchmarks, surpassing existing methods by
a significant margin. As the first comprehensive study on V+L robustness, this
work puts robustness of pre-trained models into sharper focus, pointing new
directions for future study.
Related papers
- Partially Recentralization Softmax Loss for Vision-Language Models Robustness [8.78222772167501]
We study the adversarial robustness provided by modifying loss function of pre-trained multimodal models.
Our experiments show that after a fine-tuning, adversarial robustness of pre-trained models can be significantly improved, against popular attacks.
arXiv Detail & Related papers (2024-02-06T01:44:38Z) - Fantastic Gains and Where to Find Them: On the Existence and Prospect of
General Knowledge Transfer between Any Pretrained Model [74.62272538148245]
We show that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other.
We investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation.
arXiv Detail & Related papers (2023-10-26T17:59:46Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey [66.18478838828231]
Multi-modal pre-trained big models have drawn more and more attention in recent years.
This paper introduces the background of multi-modal pre-training by reviewing the conventional deep, pre-training works in natural language process, computer vision, and speech.
Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network, and knowledge enhanced pre-training.
arXiv Detail & Related papers (2023-02-20T15:34:03Z) - ZhichunRoad at Amazon KDD Cup 2022: MultiTask Pre-Training for
E-Commerce Product Search [4.220439000486713]
We propose a robust multilingual model to improve the quality of search results.
In pre-training stage, we adopt mlm task, classification task and contrastive learning task.
In fine-tuning stage, we use confident learning, exponential moving average method (EMA), adversarial training (FGM) and regularized dropout strategy (R-Drop)
arXiv Detail & Related papers (2023-01-31T07:31:34Z) - Plex: Towards Reliability using Pretrained Large Model Extensions [69.13326436826227]
We develop ViT-Plex and T5-Plex, pretrained large model extensions for vision and language modalities, respectively.
Plex greatly improves the state-of-the-art across reliability tasks, and simplifies the traditional protocol.
We demonstrate scaling effects over model sizes up to 1B parameters and pretraining dataset sizes up to 4B examples.
arXiv Detail & Related papers (2022-07-15T11:39:37Z) - SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark
for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks.
We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain.
We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.