Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask
Training
- URL: http://arxiv.org/abs/2204.11218v1
- Date: Sun, 24 Apr 2022 08:42:47 GMT
- Title: Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask
Training
- Authors: Yuanxin Liu, Fandong Meng, Zheng Lin, Peng Fu, Yanan Cao, Weiping
Wang, Jie Zhou
- Abstract summary: Recent studies show that pre-trained language models (PLMs) like BERT contain matchingworks that have similar transfer learning performance as the original PLM.
In this paper, we find that the BERTworks have even more potential than these studies have shown.
We train binary masks over model weights on the pre-training tasks, with the aim of preserving the universal transferability of the subnetwork.
- Score: 55.43088293183165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies on the lottery ticket hypothesis (LTH) show that pre-trained
language models (PLMs) like BERT contain matching subnetworks that have similar
transfer learning performance as the original PLM. These subnetworks are found
using magnitude-based pruning. In this paper, we find that the BERT subnetworks
have even more potential than these studies have shown. Firstly, we discover
that the success of magnitude pruning can be attributed to the preserved
pre-training performance, which correlates with the downstream transferability.
Inspired by this, we propose to directly optimize the subnetwork structure
towards the pre-training objectives, which can better preserve the pre-training
performance. Specifically, we train binary masks over model weights on the
pre-training tasks, with the aim of preserving the universal transferability of
the subnetwork, which is agnostic to any specific downstream tasks. We then
fine-tune the subnetworks on the GLUE benchmark and the SQuAD dataset. The
results show that, compared with magnitude pruning, mask training can
effectively find BERT subnetworks with improved overall performance on
downstream tasks. Moreover, our method is also more efficient in searching
subnetworks and more advantageous when fine-tuning within a certain range of
data scarcity. Our code is available at https://github.com/llyx97/TAMT.
Related papers
- One Train for Two Tasks: An Encrypted Traffic Classification Framework
Using Supervised Contrastive Learning [18.63871240173137]
We propose an effective model named a Contrastive Learning Enhanced Temporal Fusion (CLE-TFE)
In particular, we utilize supervised contrastive learning to enhance the packet-level and flow-level representations.
We also propose cross-level multi-task learning, which simultaneously accomplishes the packet-level and flow-level classification tasks in the same model with one training.
arXiv Detail & Related papers (2024-02-12T09:10:09Z) - Continual Learning: Forget-free Winning Subnetworks for Video Representations [75.40220771931132]
Winning Subnetwork (WSN) in terms of task performance is considered for various continual learning tasks.
It leverages pre-existing weights from dense networks to achieve efficient learning in Task Incremental Learning (TIL) and Task-agnostic Incremental Learning (TaIL) scenarios.
The use of Fourier Subneural Operator (FSO) within WSN is considered for Video Incremental Learning (VIL)
arXiv Detail & Related papers (2023-12-19T09:11:49Z) - A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models [53.87983344862402]
Large-scale language models (PLMs) are inefficient in terms of memory footprint and computation.
PLMs tend to rely on the dataset bias and struggle to generalize to out-of-distribution (OOD) data.
Recent studies show that sparseworks can be replaced with sparseworks without hurting the performance.
arXiv Detail & Related papers (2022-10-11T07:26:34Z) - Task-Customized Self-Supervised Pre-training with Scalable Dynamic
Routing [76.78772372631623]
A common practice for self-supervised pre-training is to use as much data as possible.
For a specific downstream task, however, involving irrelevant data in pre-training may degenerate the downstream performance.
It is burdensome and infeasible to use different downstream-task-customized datasets in pre-training for different tasks.
arXiv Detail & Related papers (2022-05-26T10:49:43Z) - Dual Lottery Ticket Hypothesis [71.95937879869334]
Lottery Ticket Hypothesis (LTH) provides a novel view to investigate sparse network training and maintain its capacity.
In this work, we regard the winning ticket from LTH as the subnetwork which is in trainable condition and its performance as our benchmark.
We propose a simple sparse network training strategy, Random Sparse Network Transformation (RST), to substantiate our DLTH.
arXiv Detail & Related papers (2022-03-08T18:06:26Z) - The Lottery Ticket Hypothesis for Pre-trained BERT Networks [137.99328302234338]
In natural language processing (NLP), enormous pre-trained models like BERT have become the standard starting point for training.
In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matchingworks capable of training in isolation to full accuracy.
We combine these observations to assess whether such trainable, transferrableworks exist in pre-trained BERT models.
arXiv Detail & Related papers (2020-07-23T19:35:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.