Pretrained Transformers Improve Out-of-Distribution Robustness
- URL: http://arxiv.org/abs/2004.06100v2
- Date: Thu, 16 Apr 2020 05:01:33 GMT
- Title: Pretrained Transformers Improve Out-of-Distribution Robustness
- Authors: Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh
Krishnan, and Dawn Song
- Abstract summary: We measure out-of-distribution generalization for seven NLP datasets.
We show that pretrained Transformers' performance declines are substantially smaller.
We examine which factors affect robustness, finding that larger models are not necessarily more robust.
- Score: 72.38747394482247
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although pretrained Transformers such as BERT achieve high accuracy on
in-distribution examples, do they generalize to new distributions? We
systematically measure out-of-distribution (OOD) generalization for seven NLP
datasets by constructing a new robustness benchmark with realistic distribution
shifts. We measure the generalization of previous models including bag-of-words
models, ConvNets, and LSTMs, and we show that pretrained Transformers'
performance declines are substantially smaller. Pretrained transformers are
also more effective at detecting anomalous or OOD examples, while many previous
models are frequently worse than chance. We examine which factors affect
robustness, finding that larger models are not necessarily more robust,
distillation can be harmful, and more diverse pretraining data can enhance
robustness. Finally, we show where future work can improve OOD robustness.
Related papers
- Model Reprogramming Outperforms Fine-tuning on Out-of-distribution Data in Text-Image Encoders [56.47577824219207]
In this paper, we unveil the hidden costs associated with intrusive fine-tuning techniques.
We introduce a new model reprogramming approach for fine-tuning, which we name Reprogrammer.
Our empirical evidence reveals that Reprogrammer is less intrusive and yields superior downstream models.
arXiv Detail & Related papers (2024-03-16T04:19:48Z) - Towards Calibrated Robust Fine-Tuning of Vision-Language Models [97.19901765814431]
This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models.
We show that both OOD classification and OOD calibration errors have a shared upper bound consisting of two terms of ID data.
Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value.
arXiv Detail & Related papers (2023-11-03T05:41:25Z) - Combining pre-trained Vision Transformers and CIDER for Out Of Domain
Detection [0.774971301405295]
Most industrial pipelines rely on pre-trained models for downstream tasks such as CNN or Vision Transformers.
This paper investigates the performance of those models on the task of out-of-domain detection.
arXiv Detail & Related papers (2023-09-06T14:41:55Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - Leveraging Pre-trained Models for Failure Analysis Triplets Generation [0.0]
We leverage the attention mechanism of pre-trained causal language models such as Transformer model for the downstream task of generating Failure Analysis Triplets (FATs)
We observe that Generative Pre-trained Transformer 2 (GPT2) outperformed other transformer model for the failure analysis triplet generation (FATG) task.
In particular, we observe that GPT2 (trained on 1.5B parameters) outperforms pre-trained BERT, BART and GPT3 by a large margin on ROUGE.
arXiv Detail & Related papers (2022-10-31T17:21:15Z) - Pretrained Transformers Do not Always Improve Robustness [23.227505403565903]
We show that PT provide less robust representation than traditional models on exposure to noisy data.
We augment PT with an adversarial filtering mechanism that has been shown to improve OOD generalization.
However, increase in generalization does not necessarily increase robustness, as we find that noisy data fools the AF method powered by PT.
arXiv Detail & Related papers (2022-10-14T09:30:36Z) - Improving Out-of-Distribution Generalization by Adversarial Training
with Structured Priors [17.936426699670864]
We show that sample-wise Adversarial Training (AT) has limited improvement on Out-of-Distribution (OOD) generalization.
We propose two AT variants with low-rank structures to train OOD-robust models.
Our proposed approaches outperform Empirical Risk Minimization (ERM) and sample-wise AT.
arXiv Detail & Related papers (2022-10-13T07:37:42Z) - Are Sample-Efficient NLP Models More Robust? [90.54786862811183]
We investigate the relationship between sample efficiency (amount of data needed to reach a given ID accuracy) and robustness (how models fare on OOD evaluation)
We find that higher sample efficiency is only correlated with better average OOD robustness on some modeling interventions and tasks, but not others.
These results suggest that general-purpose methods for improving sample efficiency are unlikely to yield universal OOD robustness improvements, since such improvements are highly dataset- and task-dependent.
arXiv Detail & Related papers (2022-10-12T17:54:59Z) - Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn.
We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z) - How Effective is Task-Agnostic Data Augmentation for Pretrained
Transformers? [7.727662147015879]
Task-agnostic forms of data augmentation have proven widely effective in computer vision, even on pretrained models.
We ask how effective these techniques really are when applied to pretrained transformers.
We observe a negative result, finding that techniques which previously reported strong improvements for non-pretrained models fail to consistently improve performance for pretrained transformers.
arXiv Detail & Related papers (2020-10-05T03:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.