The MultiBERTs: BERT Reproductions for Robustness Analysis
- URL: http://arxiv.org/abs/2106.16163v1
- Date: Wed, 30 Jun 2021 15:56:44 GMT
- Title: The MultiBERTs: BERT Reproductions for Robustness Analysis
- Authors: Thibault Sellam, Steve Yadlowsky, Jason Wei, Naomi Saphra, Alexander
D'Amour, Tal Linzen, Jasmijn Bastings, Iulia Turc, Jacob Eisenstein, Dipanjan
Das, Ian Tenney, Ellie Pavlick
- Abstract summary: Re-running pretraining can lead to substantially different conclusions about performance.
We introduce MultiBERTs: a set of 25 BERT-base checkpoints.
The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures.
- Score: 86.29162676103385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Experiments with pretrained models such as BERT are often based on a single
checkpoint. While the conclusions drawn apply to the artifact (i.e., the
particular instance of the model), it is not always clear whether they hold for
the more general procedure (which includes the model architecture, training
data, initialization scheme, and loss function). Recent work has shown that
re-running pretraining can lead to substantially different conclusions about
performance, suggesting that alternative evaluations are needed to make
principled statements about procedures. To address this question, we introduce
MultiBERTs: a set of 25 BERT-base checkpoints, trained with similar
hyper-parameters as the original BERT model but differing in random
initialization and data shuffling. The aim is to enable researchers to draw
robust and statistically justified conclusions about pretraining procedures.
The full release includes 25 fully trained checkpoints, as well as statistical
guidelines and a code library implementing our recommended hypothesis testing
methods. Finally, for five of these models we release a set of 28 intermediate
checkpoints in order to support research on learning dynamics.
Related papers
- AsserT5: Test Assertion Generation Using a Fine-Tuned Code Language Model [8.995812770349602]
We propose AsserT5, a new model based on the pre-trained CodeT5 model.
We find that the abstraction and the inclusion of the focal method are useful also for a fine-tuned pre-trained model.
arXiv Detail & Related papers (2025-02-04T20:42:22Z) - Test-Time Alignment via Hypothesis Reweighting [56.71167047381817]
Large pretrained models often struggle with underspecified tasks.
We propose a novel framework to address the challenge of aligning models to test-time user intent.
arXiv Detail & Related papers (2024-12-11T23:02:26Z) - Bag of Lies: Robustness in Continuous Pre-training BERT [2.4850657856181946]
This study aims to acquire more insights into the continuous pre-training phase of BERT regarding entity knowledge.
Since the last update of BERT's pre-training data, the model has little to no entity knowledge about COVID-19.
We compare the baseline BERT model with the further pre-trained variants on the fact-checking benchmark Check-COVID.
arXiv Detail & Related papers (2024-06-14T12:16:08Z) - The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis [27.310894780313618]
This paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints.
We confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes.
In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints.
arXiv Detail & Related papers (2024-04-01T16:00:01Z) - A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts [117.72709110877939]
Test-time adaptation (TTA) has the potential to adapt a pre-trained model to unlabeled data during testing, before making predictions.
We categorize TTA into several distinct groups based on the form of test data, namely, test-time domain adaptation, test-time batch adaptation, and online test-time adaptation.
arXiv Detail & Related papers (2023-03-27T16:32:21Z) - Manual Evaluation Matters: Reviewing Test Protocols of Distantly
Supervised Relation Extraction [61.48964753725744]
We build manually-annotated test sets for two DS-RE datasets, NYT10 and Wiki20, and thoroughly evaluate several competitive models.
Results show that the manual evaluation can indicate very different conclusions from automatic ones.
arXiv Detail & Related papers (2021-05-20T06:55:40Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z) - Fine-Tuning Pretrained Language Models: Weight Initializations, Data
Orders, and Early Stopping [62.78338049381917]
Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing.
We experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds.
We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials.
arXiv Detail & Related papers (2020-02-15T02:40:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.