The Lottery Ticket Hypothesis for Pre-trained BERT Networks
- URL: http://arxiv.org/abs/2007.12223v2
- Date: Sun, 18 Oct 2020 20:10:29 GMT
- Title: The Lottery Ticket Hypothesis for Pre-trained BERT Networks
- Authors: Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang,
Zhangyang Wang, Michael Carbin
- Abstract summary: In natural language processing (NLP), enormous pre-trained models like BERT have become the standard starting point for training.
In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matchingworks capable of training in isolation to full accuracy.
We combine these observations to assess whether such trainable, transferrableworks exist in pre-trained BERT models.
- Score: 137.99328302234338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In natural language processing (NLP), enormous pre-trained models like BERT
have become the standard starting point for training on a range of downstream
tasks, and similar trends are emerging in other areas of deep learning. In
parallel, work on the lottery ticket hypothesis has shown that models for NLP
and computer vision contain smaller matching subnetworks capable of training in
isolation to full accuracy and transferring to other tasks. In this work, we
combine these observations to assess whether such trainable, transferrable
subnetworks exist in pre-trained BERT models. For a range of downstream tasks,
we indeed find matching subnetworks at 40% to 90% sparsity. We find these
subnetworks at (pre-trained) initialization, a deviation from prior NLP
research where they emerge only after some amount of training. Subnetworks
found on the masked language modeling task (the same task used to pre-train the
model) transfer universally; those found on other tasks transfer in a limited
fashion if at all. As large-scale pre-training becomes an increasingly central
paradigm in deep learning, our results demonstrate that the main lottery ticket
observations remain relevant in this context. Codes available at
https://github.com/VITA-Group/BERT-Tickets.
Related papers
- Data-Efficient Double-Win Lottery Tickets from Robust Pre-training [129.85939347733387]
We introduce Double-Win Lottery Tickets, in which a subnetwork from a pre-trained model can be independently transferred on diverse downstream tasks.
We find that robust pre-training tends to craft sparser double-win lottery tickets with superior performance over the standard counterparts.
arXiv Detail & Related papers (2022-06-09T20:52:50Z) - Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask
Training [55.43088293183165]
Recent studies show that pre-trained language models (PLMs) like BERT contain matchingworks that have similar transfer learning performance as the original PLM.
In this paper, we find that the BERTworks have even more potential than these studies have shown.
We train binary masks over model weights on the pre-training tasks, with the aim of preserving the universal transferability of the subnetwork.
arXiv Detail & Related papers (2022-04-24T08:42:47Z) - Dual Lottery Ticket Hypothesis [71.95937879869334]
Lottery Ticket Hypothesis (LTH) provides a novel view to investigate sparse network training and maintain its capacity.
In this work, we regard the winning ticket from LTH as the subnetwork which is in trainable condition and its performance as our benchmark.
We propose a simple sparse network training strategy, Random Sparse Network Transformation (RST), to substantiate our DLTH.
arXiv Detail & Related papers (2022-03-08T18:06:26Z) - Playing Lottery Tickets with Vision and Language [62.6420670250559]
Large-scale transformer-based pre-training has revolutionized vision-and-language (V+L) research.
In parallel, work on the lottery ticket hypothesis has shown that deep neural networks contain small matchingworks that can achieve on par or even better performance than the dense networks when trained in isolation.
We use UNITER, one of the best-performing V+L models, as the testbed, and consolidate 7 representative V+L tasks for experiments.
arXiv Detail & Related papers (2021-04-23T22:24:33Z) - The Lottery Tickets Hypothesis for Supervised and Self-supervised
Pre-training in Computer Vision Models [115.49214555402567]
Pre-trained weights often boost a wide range of downstream tasks including classification, detection, and segmentation.
Recent studies suggest that pre-training benefits from gigantic model capacity.
In this paper, we examine supervised and self-supervised pre-trained models through the lens of the lottery ticket hypothesis (LTH)
arXiv Detail & Related papers (2020-12-12T21:53:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.