Related papers: Pretrained Transformers Improve Out-of-Distribution Robustness

Pretrained Transformers Improve Out-of-Distribution Robustness

URL: http://arxiv.org/abs/2004.06100v2
Date: Thu, 16 Apr 2020 05:01:33 GMT
Title: Pretrained Transformers Improve Out-of-Distribution Robustness
Authors: Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, and Dawn Song
Abstract summary: We measure out-of-distribution generalization for seven NLP datasets. We show that pretrained Transformers' performance declines are substantially smaller. We examine which factors affect robustness, finding that larger models are not necessarily more robust.
Score: 72.38747394482247
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although pretrained Transformers such as BERT achieve high accuracy on in-distribution examples, do they generalize to new distributions? We systematically measure out-of-distribution (OOD) generalization for seven NLP datasets by constructing a new robustness benchmark with realistic distribution shifts. We measure the generalization of previous models including bag-of-words models, ConvNets, and LSTMs, and we show that pretrained Transformers' performance declines are substantially smaller. Pretrained transformers are also more effective at detecting anomalous or OOD examples, while many previous models are frequently worse than chance. We examine which factors affect robustness, finding that larger models are not necessarily more robust, distillation can be harmful, and more diverse pretraining data can enhance robustness. Finally, we show where future work can improve OOD robustness.

Related papers

Scaling Transformers for Time Series Forecasting: Do Pretrained Large Models Outperform Small-Scale Alternatives? [4.075971633195745]
This work examines whether pre-trained large-scale time series models (LSTSMs) can outperform traditional non-pretrained small-scale transformers in forecasting tasks.<n>We analyze state-of-the-art (SOTA) pre-trained universal time series models (e.g., Moirai, TimeGPT) alongside conventional transformers.<n>Our findings reveal the strengths and limitations of pre-trained LSTSMs, providing insights into their suitability for time series tasks.
arXiv Detail & Related papers (2025-06-24T11:54:10Z)
reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs [64.29893431743608]
We show that state-of-the-art reward models suffer from substantial performance degradation even with minor input transformations. We propose to explicitly train them to assign similar scores to paraphrases, and find that this approach also improves robustness to other distinct kinds of transformations.
arXiv Detail & Related papers (2025-03-14T17:59:41Z)
Model Reprogramming Outperforms Fine-tuning on Out-of-distribution Data in Text-Image Encoders [56.47577824219207]
In this paper, we unveil the hidden costs associated with intrusive fine-tuning techniques. We introduce a new model reprogramming approach for fine-tuning, which we name Reprogrammer. Our empirical evidence reveals that Reprogrammer is less intrusive and yields superior downstream models.
arXiv Detail & Related papers (2024-03-16T04:19:48Z)
Towards Calibrated Robust Fine-Tuning of Vision-Language Models [97.19901765814431]
This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models. We show that both OOD classification and OOD calibration errors have a shared upper bound consisting of two terms of ID data. Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value.
arXiv Detail & Related papers (2023-11-03T05:41:25Z)
Combining pre-trained Vision Transformers and CIDER for Out Of Domain Detection [0.774971301405295]
Most industrial pipelines rely on pre-trained models for downstream tasks such as CNN or Vision Transformers. This paper investigates the performance of those models on the task of out-of-domain detection.
arXiv Detail & Related papers (2023-09-06T14:41:55Z)
Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches. This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z)
Leveraging Pre-trained Models for Failure Analysis Triplets Generation [0.0]
We leverage the attention mechanism of pre-trained causal language models such as Transformer model for the downstream task of generating Failure Analysis Triplets (FATs) We observe that Generative Pre-trained Transformer 2 (GPT2) outperformed other transformer model for the failure analysis triplet generation (FATG) task. In particular, we observe that GPT2 (trained on 1.5B parameters) outperforms pre-trained BERT, BART and GPT3 by a large margin on ROUGE.
arXiv Detail & Related papers (2022-10-31T17:21:15Z)
Pretrained Transformers Do not Always Improve Robustness [23.227505403565903]
We show that PT provide less robust representation than traditional models on exposure to noisy data. We augment PT with an adversarial filtering mechanism that has been shown to improve OOD generalization. However, increase in generalization does not necessarily increase robustness, as we find that noisy data fools the AF method powered by PT.
arXiv Detail & Related papers (2022-10-14T09:30:36Z)
Improving Out-of-Distribution Generalization by Adversarial Training with Structured Priors [17.936426699670864]
We show that sample-wise Adversarial Training (AT) has limited improvement on Out-of-Distribution (OOD) generalization. We propose two AT variants with low-rank structures to train OOD-robust models. Our proposed approaches outperform Empirical Risk Minimization (ERM) and sample-wise AT.
arXiv Detail & Related papers (2022-10-13T07:37:42Z)
Are Sample-Efficient NLP Models More Robust? [90.54786862811183]
We investigate the relationship between sample efficiency (amount of data needed to reach a given ID accuracy) and robustness (how models fare on OOD evaluation) We find that higher sample efficiency is only correlated with better average OOD robustness on some modeling interventions and tasks, but not others. These results suggest that general-purpose methods for improving sample efficiency are unlikely to yield universal OOD robustness improvements, since such improvements are highly dataset- and task-dependent.
arXiv Detail & Related papers (2022-10-12T17:54:59Z)
Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn. We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z)
How Effective is Task-Agnostic Data Augmentation for Pretrained Transformers? [7.727662147015879]
Task-agnostic forms of data augmentation have proven widely effective in computer vision, even on pretrained models. We ask how effective these techniques really are when applied to pretrained transformers. We observe a negative result, finding that techniques which previously reported strong improvements for non-pretrained models fail to consistently improve performance for pretrained transformers.
arXiv Detail & Related papers (2020-10-05T03:55:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.