Finding the Winning Ticket of BERT for Binary Text Classification via
Adaptive Layer Truncation before Fine-tuning
- URL: http://arxiv.org/abs/2111.10951v1
- Date: Mon, 22 Nov 2021 02:22:47 GMT
- Title: Finding the Winning Ticket of BERT for Binary Text Classification via
Adaptive Layer Truncation before Fine-tuning
- Authors: Jing Fan, Xin Zhang, Sheng Zhang
- Abstract summary: We construct a series of BERT-based models with different size and compare their predictions on 8 binary classification tasks.
The results show there truly exist smaller sub-networks performing better than the full model.
- Score: 7.797987384189306
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In light of the success of transferring language models into NLP tasks, we
ask whether the full BERT model is always the best and does it exist a simple
but effective method to find the winning ticket in state-of-the-art deep neural
networks without complex calculations. We construct a series of BERT-based
models with different size and compare their predictions on 8 binary
classification tasks. The results show there truly exist smaller sub-networks
performing better than the full model. Then we present a further study and
propose a simple method to shrink BERT appropriately before fine-tuning. Some
extended experiments indicate that our method could save time and storage
overhead extraordinarily with little even no accuracy loss.
Related papers
- Breaking the Token Barrier: Chunking and Convolution for Efficient Long
Text Classification with BERT [0.0]
Transformer-based models, specifically BERT, have propelled research in various NLP tasks.
BERT models are limited to a maximum token limit of 512 tokens. Consequently, this makes it non-trivial to apply it in a practical setting with long input.
We propose a relatively simple extension to vanilla BERT architecture called ChunkBERT that allows finetuning of any pretrained models to perform inference on arbitrarily long text.
arXiv Detail & Related papers (2023-10-31T15:41:08Z) - DPBERT: Efficient Inference for BERT based on Dynamic Planning [11.680840266488884]
Existing input-adaptive inference methods fail to take full advantage of the structure of BERT.
We propose Dynamic Planning in BERT, a novel fine-tuning strategy that can accelerate the inference process of BERT.
Our method reduces latency to 75% while maintaining 98% accuracy, yielding a better accuracy-speed trade-off compared to state-of-the-art input-adaptive methods.
arXiv Detail & Related papers (2023-07-26T07:18:50Z) - Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study [68.75670223005716]
We find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay.
Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay.
arXiv Detail & Related papers (2023-03-02T09:03:43Z) - Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask
Training [55.43088293183165]
Recent studies show that pre-trained language models (PLMs) like BERT contain matchingworks that have similar transfer learning performance as the original PLM.
In this paper, we find that the BERTworks have even more potential than these studies have shown.
We train binary masks over model weights on the pre-training tasks, with the aim of preserving the universal transferability of the subnetwork.
arXiv Detail & Related papers (2022-04-24T08:42:47Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - BERTVision -- A Parameter-Efficient Approach for Question Answering [0.0]
We present a highly parameter efficient approach for Question Answering that significantly reduces the need for extended BERT fine-tuning.
Our method uses information from the hidden state activations of each BERT transformer layer, which is discarded during typical BERT inference.
Our experiments show that this approach works well not only for span QA, but also for classification, suggesting that it may be to a wider range of tasks.
arXiv Detail & Related papers (2022-02-24T17:16:25Z) - BinaryBERT: Pushing the Limit of BERT Quantization [74.65543496761553]
We propose BinaryBERT, which pushes BERT quantization to the limit with weight binarization.
We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscapes.
Empirical results show that BinaryBERT has negligible performance drop compared to the full-precision BERT-base.
arXiv Detail & Related papers (2020-12-31T16:34:54Z) - A Comparison of LSTM and BERT for Small Corpus [0.0]
Recent advancements in the NLP field showed that transfer learning helps with achieving state-of-the-art results for new tasks by tuning pre-trained models instead of starting from scratch.
In this paper we focus on a real-life scenario that scientists in academia and industry face frequently: given a small dataset, can we use a large pre-trained model like BERT and get better results than simple models?
Our experimental results show that bidirectional LSTM models can achieve significantly higher results than a BERT model for a small dataset and these simple models get trained in much less time than tuning the pre-trained counterparts.
arXiv Detail & Related papers (2020-09-11T14:01:14Z) - The Lottery Ticket Hypothesis for Pre-trained BERT Networks [137.99328302234338]
In natural language processing (NLP), enormous pre-trained models like BERT have become the standard starting point for training.
In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matchingworks capable of training in isolation to full accuracy.
We combine these observations to assess whether such trainable, transferrableworks exist in pre-trained BERT models.
arXiv Detail & Related papers (2020-07-23T19:35:39Z) - DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications.
We propose a simple but effective method, DeeBERT, to accelerate BERT inference.
Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.