Learning Dynamic BERT via Trainable Gate Variables and a Bi-modal
Regularizer
- URL: http://arxiv.org/abs/2102.09727v1
- Date: Fri, 19 Feb 2021 03:59:23 GMT
- Title: Learning Dynamic BERT via Trainable Gate Variables and a Bi-modal
Regularizer
- Authors: Seohyeong Jeong, Nojun Kwak
- Abstract summary: The BERT model has shown significant success on various natural language processing tasks.
Due to the heavy model size and high computational cost, the model suffers from high latency, which is fatal to its deployments on resource-limited devices.
We propose a dynamic inference method on BERT via trainable gate variables applied on input tokens and a regularizer that has a bi-modal property.
- Score: 36.74058297640735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The BERT model has shown significant success on various natural language
processing tasks. However, due to the heavy model size and high computational
cost, the model suffers from high latency, which is fatal to its deployments on
resource-limited devices. To tackle this problem, we propose a dynamic
inference method on BERT via trainable gate variables applied on input tokens
and a regularizer that has a bi-modal property. Our method shows reduced
computational cost on the GLUE dataset with a minimal performance drop.
Moreover, the model adjusts with a trade-off between performance and
computational cost with the user-specified hyperparameter.
Related papers
- TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters [102.1116808722299]
We introduce TokenFormer, a scalable architecture for scaling Transformers.
By treating model parameters as tokens, we replace all the linear projections in Transformers.
Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs.
arXiv Detail & Related papers (2024-10-30T16:19:00Z) - Step-by-Step Unmasking for Parameter-Efficient Fine-tuning of Large Language Models [18.877891285367216]
A class of parameter-efficient fine-tuning (PEFT) aims to mitigate computational challenges by selectively fine-tuning only a small fraction of the model parameters.
We introduce $textID3$, a novel selective PEFT method that calculates parameter importance continually and dynamically unmasks parameters.
We analytically show that $textID3$ reduces the number of gradient updates by a factor of two, enhancing computational efficiency.
arXiv Detail & Related papers (2024-08-26T17:58:53Z) - Boosting Inference Efficiency: Unleashing the Power of Parameter-Shared
Pre-trained Language Models [109.06052781040916]
We introduce a technique to enhance the inference efficiency of parameter-shared language models.
We also propose a simple pre-training technique that leads to fully or partially shared models.
Results demonstrate the effectiveness of our methods on both autoregressive and autoencoding PLMs.
arXiv Detail & Related papers (2023-10-19T15:13:58Z) - DPBERT: Efficient Inference for BERT based on Dynamic Planning [11.680840266488884]
Existing input-adaptive inference methods fail to take full advantage of the structure of BERT.
We propose Dynamic Planning in BERT, a novel fine-tuning strategy that can accelerate the inference process of BERT.
Our method reduces latency to 75% while maintaining 98% accuracy, yielding a better accuracy-speed trade-off compared to state-of-the-art input-adaptive methods.
arXiv Detail & Related papers (2023-07-26T07:18:50Z) - Sensi-BERT: Towards Sensitivity Driven Fine-Tuning for
Parameter-Efficient BERT [6.029590006321152]
We present Sensi-BERT, a sensitivity driven efficient fine-tuning of BERT models for downstream tasks.
Our experiments show the efficacy of Sensi-BERT across different downstream tasks including MNLI, QQP, QNLI, SST-2 and SQuAD.
arXiv Detail & Related papers (2023-07-14T17:24:15Z) - Value function estimation using conditional diffusion models for control [62.27184818047923]
We propose a simple algorithm called Diffused Value Function (DVF)
It learns a joint multi-step model of the environment-robot interaction dynamics using a diffusion model.
We show how DVF can be used to efficiently capture the state visitation measure for multiple controllers.
arXiv Detail & Related papers (2023-06-09T18:40:55Z) - Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers [29.319666323947708]
We present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness.
Our method employs a learnable mechanism that determines which uninformative tokens can be dropped from the context.
Our reference implementation achieves up to $2times$ increase in inference throughput and even greater memory savings.
arXiv Detail & Related papers (2023-05-25T07:39:41Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - DACT-BERT: Differentiable Adaptive Computation Time for an Efficient
BERT Inference [3.375478015832455]
We propose DACT-BERT, a differentiable adaptive computation time strategy for BERT-like models.
DACT-BERT adds an adaptive computational mechanism to BERT's regular processing pipeline, which controls the number of Transformer blocks that need to be executed at inference time.
Our experiments demonstrate that our approach, when compared to the baselines, excels on a reduced computational regime and is competitive in other less restrictive ones.
arXiv Detail & Related papers (2021-09-24T04:45:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.