The MultiBERTs: BERT Reproductions for Robustness Analysis
- URL: http://arxiv.org/abs/2106.16163v1
- Date: Wed, 30 Jun 2021 15:56:44 GMT
- Title: The MultiBERTs: BERT Reproductions for Robustness Analysis
- Authors: Thibault Sellam, Steve Yadlowsky, Jason Wei, Naomi Saphra, Alexander
D'Amour, Tal Linzen, Jasmijn Bastings, Iulia Turc, Jacob Eisenstein, Dipanjan
Das, Ian Tenney, Ellie Pavlick
- Abstract summary: Re-running pretraining can lead to substantially different conclusions about performance.
We introduce MultiBERTs: a set of 25 BERT-base checkpoints.
The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures.
- Score: 86.29162676103385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Experiments with pretrained models such as BERT are often based on a single
checkpoint. While the conclusions drawn apply to the artifact (i.e., the
particular instance of the model), it is not always clear whether they hold for
the more general procedure (which includes the model architecture, training
data, initialization scheme, and loss function). Recent work has shown that
re-running pretraining can lead to substantially different conclusions about
performance, suggesting that alternative evaluations are needed to make
principled statements about procedures. To address this question, we introduce
MultiBERTs: a set of 25 BERT-base checkpoints, trained with similar
hyper-parameters as the original BERT model but differing in random
initialization and data shuffling. The aim is to enable researchers to draw
robust and statistically justified conclusions about pretraining procedures.
The full release includes 25 fully trained checkpoints, as well as statistical
guidelines and a code library implementing our recommended hypothesis testing
methods. Finally, for five of these models we release a set of 28 intermediate
checkpoints in order to support research on learning dynamics.
Related papers
- Bag of Lies: Robustness in Continuous Pre-training BERT [2.4850657856181946]
This study aims to acquire more insights into the continuous pre-training phase of BERT regarding entity knowledge.
Since the last update of BERT's pre-training data, the model has little to no entity knowledge about COVID-19.
We compare the baseline BERT model with the further pre-trained variants on the fact-checking benchmark Check-COVID.
arXiv Detail & Related papers (2024-06-14T12:16:08Z) - The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis [27.310894780313618]
This paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints.
We confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes.
In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints.
arXiv Detail & Related papers (2024-04-01T16:00:01Z) - Autoregressive Score Generation for Multi-trait Essay Scoring [8.531986117865946]
We propose an autoregressive prediction of multi-trait scores (ArTS) in automated essay scoring (AES)
Unlike prior regression or classification methods, we redefine AES as a score-generation task, allowing a single model to predict multiple scores.
Experimental results proved the efficacy of ArTS, showing over 5% average improvements in both prompts and traits.
arXiv Detail & Related papers (2024-03-13T08:34:53Z) - AdaNPC: Exploring Non-Parametric Classifier for Test-Time Adaptation [64.9230895853942]
Domain generalization can be arbitrarily hard without exploiting target domain information.
Test-time adaptive (TTA) methods are proposed to address this issue.
In this work, we adopt Non-Parametric to perform the test-time Adaptation (AdaNPC)
arXiv Detail & Related papers (2023-04-25T04:23:13Z) - A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts [143.14128737978342]
Test-time adaptation, an emerging paradigm, has the potential to adapt a pre-trained model to unlabeled data during testing, before making predictions.
Recent progress in this paradigm highlights the significant benefits of utilizing unlabeled data for training self-adapted models prior to inference.
arXiv Detail & Related papers (2023-03-27T16:32:21Z) - Manual Evaluation Matters: Reviewing Test Protocols of Distantly
Supervised Relation Extraction [61.48964753725744]
We build manually-annotated test sets for two DS-RE datasets, NYT10 and Wiki20, and thoroughly evaluate several competitive models.
Results show that the manual evaluation can indicate very different conclusions from automatic ones.
arXiv Detail & Related papers (2021-05-20T06:55:40Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z) - Fine-Tuning Pretrained Language Models: Weight Initializations, Data
Orders, and Early Stopping [62.78338049381917]
Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing.
We experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds.
We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials.
arXiv Detail & Related papers (2020-02-15T02:40:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.