Related papers: The MultiBERTs: BERT Reproductions for Robustness Analysis

The MultiBERTs: BERT Reproductions for Robustness Analysis

URL: http://arxiv.org/abs/2106.16163v1
Date: Wed, 30 Jun 2021 15:56:44 GMT
Title: The MultiBERTs: BERT Reproductions for Robustness Analysis
Authors: Thibault Sellam, Steve Yadlowsky, Jason Wei, Naomi Saphra, Alexander D'Amour, Tal Linzen, Jasmijn Bastings, Iulia Turc, Jacob Eisenstein, Dipanjan Das, Ian Tenney, Ellie Pavlick
Abstract summary: Re-running pretraining can lead to substantially different conclusions about performance. We introduce MultiBERTs: a set of 25 BERT-base checkpoints. The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures.
Score: 86.29162676103385
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Experiments with pretrained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure (which includes the model architecture, training data, initialization scheme, and loss function). Recent work has shown that re-running pretraining can lead to substantially different conclusions about performance, suggesting that alternative evaluations are needed to make principled statements about procedures. To address this question, we introduce MultiBERTs: a set of 25 BERT-base checkpoints, trained with similar hyper-parameters as the original BERT model but differing in random initialization and data shuffling. The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures. The full release includes 25 fully trained checkpoints, as well as statistical guidelines and a code library implementing our recommended hypothesis testing methods. Finally, for five of these models we release a set of 28 intermediate checkpoints in order to support research on learning dynamics.

Related papers

AsserT5: Test Assertion Generation Using a Fine-Tuned Code Language Model [8.995812770349602]
We propose AsserT5, a new model based on the pre-trained CodeT5 model. We find that the abstraction and the inclusion of the focal method are useful also for a fine-tuned pre-trained model.
arXiv Detail & Related papers (2025-02-04T20:42:22Z)
Test-Time Alignment via Hypothesis Reweighting [56.71167047381817]
Large pretrained models often struggle with underspecified tasks. We propose a novel framework to address the challenge of aligning models to test-time user intent.
arXiv Detail & Related papers (2024-12-11T23:02:26Z)
Bag of Lies: Robustness in Continuous Pre-training BERT [2.4850657856181946]
This study aims to acquire more insights into the continuous pre-training phase of BERT regarding entity knowledge. Since the last update of BERT's pre-training data, the model has little to no entity knowledge about COVID-19. We compare the baseline BERT model with the further pre-trained variants on the fact-checking benchmark Check-COVID.
arXiv Detail & Related papers (2024-06-14T12:16:08Z)
The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis [27.310894780313618]
This paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints. We confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes. In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints.
arXiv Detail & Related papers (2024-04-01T16:00:01Z)
Autoregressive Score Generation for Multi-trait Essay Scoring [8.531986117865946]
We propose an autoregressive prediction of multi-trait scores (ArTS) in automated essay scoring (AES) Unlike prior regression or classification methods, we redefine AES as a score-generation task, allowing a single model to predict multiple scores. Experimental results proved the efficacy of ArTS, showing over 5% average improvements in both prompts and traits.
arXiv Detail & Related papers (2024-03-13T08:34:53Z)
AdaNPC: Exploring Non-Parametric Classifier for Test-Time Adaptation [64.9230895853942]
Domain generalization can be arbitrarily hard without exploiting target domain information. Test-time adaptive (TTA) methods are proposed to address this issue. In this work, we adopt Non-Parametric to perform the test-time Adaptation (AdaNPC)
arXiv Detail & Related papers (2023-04-25T04:23:13Z)
A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts [143.14128737978342]
Test-time adaptation, an emerging paradigm, has the potential to adapt a pre-trained model to unlabeled data during testing, before making predictions. Recent progress in this paradigm highlights the significant benefits of utilizing unlabeled data for training self-adapted models prior to inference.
arXiv Detail & Related papers (2023-03-27T16:32:21Z)
Manual Evaluation Matters: Reviewing Test Protocols of Distantly Supervised Relation Extraction [61.48964753725744]
We build manually-annotated test sets for two DS-RE datasets, NYT10 and Wiki20, and thoroughly evaluate several competitive models. Results show that the manual evaluation can indicate very different conclusions from automatic ones.
arXiv Detail & Related papers (2021-05-20T06:55:40Z)
Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks. We introduce a new scoring method that casts a plausibility ranking task in a full-text format. We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping [62.78338049381917]
Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing. We experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds. We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials.
arXiv Detail & Related papers (2020-02-15T02:40:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.