Related papers: Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training

Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training

URL: http://arxiv.org/abs/2204.11218v1
Date: Sun, 24 Apr 2022 08:42:47 GMT
Title: Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training
Authors: Yuanxin Liu, Fandong Meng, Zheng Lin, Peng Fu, Yanan Cao, Weiping Wang, Jie Zhou
Abstract summary: Recent studies show that pre-trained language models (PLMs) like BERT contain matchingworks that have similar transfer learning performance as the original PLM. In this paper, we find that the BERTworks have even more potential than these studies have shown. We train binary masks over model weights on the pre-training tasks, with the aim of preserving the universal transferability of the subnetwork.
Score: 55.43088293183165
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent studies on the lottery ticket hypothesis (LTH) show that pre-trained language models (PLMs) like BERT contain matching subnetworks that have similar transfer learning performance as the original PLM. These subnetworks are found using magnitude-based pruning. In this paper, we find that the BERT subnetworks have even more potential than these studies have shown. Firstly, we discover that the success of magnitude pruning can be attributed to the preserved pre-training performance, which correlates with the downstream transferability. Inspired by this, we propose to directly optimize the subnetwork structure towards the pre-training objectives, which can better preserve the pre-training performance. Specifically, we train binary masks over model weights on the pre-training tasks, with the aim of preserving the universal transferability of the subnetwork, which is agnostic to any specific downstream tasks. We then fine-tune the subnetworks on the GLUE benchmark and the SQuAD dataset. The results show that, compared with magnitude pruning, mask training can effectively find BERT subnetworks with improved overall performance on downstream tasks. Moreover, our method is also more efficient in searching subnetworks and more advantageous when fine-tuning within a certain range of data scarcity. Our code is available at https://github.com/llyx97/TAMT.

Related papers

One Train for Two Tasks: An Encrypted Traffic Classification Framework Using Supervised Contrastive Learning [18.63871240173137]
We propose an effective model named a Contrastive Learning Enhanced Temporal Fusion (CLE-TFE) In particular, we utilize supervised contrastive learning to enhance the packet-level and flow-level representations. We also propose cross-level multi-task learning, which simultaneously accomplishes the packet-level and flow-level classification tasks in the same model with one training.
arXiv Detail & Related papers (2024-02-12T09:10:09Z)
Continual Learning: Forget-free Winning Subnetworks for Video Representations [75.40220771931132]
Winning Subnetwork (WSN) in terms of task performance is considered for various continual learning tasks. It leverages pre-existing weights from dense networks to achieve efficient learning in Task Incremental Learning (TIL) and Task-agnostic Incremental Learning (TaIL) scenarios. The use of Fourier Subneural Operator (FSO) within WSN is considered for Video Incremental Learning (VIL)
arXiv Detail & Related papers (2023-12-19T09:11:49Z)
A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models [53.87983344862402]
Large-scale language models (PLMs) are inefficient in terms of memory footprint and computation. PLMs tend to rely on the dataset bias and struggle to generalize to out-of-distribution (OOD) data. Recent studies show that sparseworks can be replaced with sparseworks without hurting the performance.
arXiv Detail & Related papers (2022-10-11T07:26:34Z)
Task-Customized Self-Supervised Pre-training with Scalable Dynamic Routing [76.78772372631623]
A common practice for self-supervised pre-training is to use as much data as possible. For a specific downstream task, however, involving irrelevant data in pre-training may degenerate the downstream performance. It is burdensome and infeasible to use different downstream-task-customized datasets in pre-training for different tasks.
arXiv Detail & Related papers (2022-05-26T10:49:43Z)
Dual Lottery Ticket Hypothesis [71.95937879869334]
Lottery Ticket Hypothesis (LTH) provides a novel view to investigate sparse network training and maintain its capacity. In this work, we regard the winning ticket from LTH as the subnetwork which is in trainable condition and its performance as our benchmark. We propose a simple sparse network training strategy, Random Sparse Network Transformation (RST), to substantiate our DLTH.
arXiv Detail & Related papers (2022-03-08T18:06:26Z)
The Lottery Ticket Hypothesis for Pre-trained BERT Networks [137.99328302234338]
In natural language processing (NLP), enormous pre-trained models like BERT have become the standard starting point for training. In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matchingworks capable of training in isolation to full accuracy. We combine these observations to assess whether such trainable, transferrableworks exist in pre-trained BERT models.
arXiv Detail & Related papers (2020-07-23T19:35:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.