Related papers: schuBERT: Optimizing Elements of BERT

schuBERT: Optimizing Elements of BERT

URL: http://arxiv.org/abs/2005.06628v1
Date: Sat, 9 May 2020 21:56:04 GMT
Title: schuBERT: Optimizing Elements of BERT
Authors: Ashish Khetan, Zohar Karnin
Abstract summary: We revisit the architecture choices of BERT in efforts to obtain a lighter model. We show that much efficient light BERT models can be obtained by reducing algorithmically chosen correct architecture design dimensions. In particular, our schuBERT gives $6.6%$ higher average accuracy on GLUE and SQuAD datasets as compared to BERT with three encoder layers.
Score: 22.463154358632472
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers \citep{vaswani2017attention} have gradually become a key component for many state-of-the-art natural language representation models. A recent Transformer based model- BERT \citep{devlin2018bert} achieved state-of-the-art results on various natural language processing tasks, including GLUE, SQuAD v1.1, and SQuAD v2.0. This model however is computationally prohibitive and has a huge number of parameters. In this work we revisit the architecture choices of BERT in efforts to obtain a lighter model. We focus on reducing the number of parameters yet our methods can be applied towards other objectives such FLOPs or latency. We show that much efficient light BERT models can be obtained by reducing algorithmically chosen correct architecture design dimensions rather than reducing the number of Transformer encoder layers. In particular, our schuBERT gives $6.6\%$ higher average accuracy on GLUE and SQuAD datasets as compared to BERT with three encoder layers while having the same number of parameters.

Related papers

SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers [88.68985153780514]
We propose a new selective PEFT method, namely SparseGrad, that performs well on parameter blocks. We apply SparseGrad to fine-tune BERT and RoBERTa for the NLU task and LLaMa-2 for the Question-Answering task.
arXiv Detail & Related papers (2024-10-09T19:03:52Z)
Sensi-BERT: Towards Sensitivity Driven Fine-Tuning for Parameter-Efficient BERT [6.029590006321152]
We present Sensi-BERT, a sensitivity driven efficient fine-tuning of BERT models for downstream tasks. Our experiments show the efficacy of Sensi-BERT across different downstream tasks including MNLI, QQP, QNLI, SST-2 and SQuAD.
arXiv Detail & Related papers (2023-07-14T17:24:15Z)
Block-wise Bit-Compression of Transformer-based Models [9.77519365079468]
We propose BBCT, a method of block-wise bit-compression for transformer without retraining. Our benchmark test results on General Language Understanding Evaluation (GLUE) show that BBCT can achieve less than 1% accuracy drop in most tasks.
arXiv Detail & Related papers (2023-03-16T09:53:57Z)
Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks. adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations. In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z)
LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT [69.77358429702873]
We propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically. Experiments on automatic speech recognition (ASR) and the SUPERB benchmark show the proposed LightHuBERT enables over $109$ architectures. LightHuBERT achieves comparable performance to the teacher model in most tasks with a reduction of 29% parameters.
arXiv Detail & Related papers (2022-03-29T14:20:55Z)
Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks. These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices. We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z)
Deploying a BERT-based Query-Title Relevance Classifier in a Production System: a View from the Trenches [3.1219977244201056]
Bidirectional Representations from Transformers (BERT) model has been radically improving the performance of many Natural Language Processing (NLP) tasks. It is challenging to scale BERT for low-latency and high- throughput industrial use cases due to its enormous size. We successfully optimize a Query-Title Relevance (QTR) classifier for deployment via a compact model, which we name BERT Bidirectional Long Short-Term Memory (BertBiLSTM) BertBiLSTM exceeds the off-the-shelf BERT model's performance in terms of accuracy and efficiency for the aforementioned real-world production task
arXiv Detail & Related papers (2021-08-23T14:28:23Z)
AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models [46.69439585453071]
We adopt the one-shot Neural Architecture Search (NAS) to automatically search architecture hyper- parameters. Specifically, we design the techniques of one-shot learning and the search space to provide an adaptive and efficient development way of tiny PLMs. We name our method AutoTinyBERT and evaluate its effectiveness on the GLUE and SQuAD benchmarks.
arXiv Detail & Related papers (2021-07-29T00:47:30Z)
IOT: Instance-wise Layer Reordering for Transformer Structures [173.39918590438245]
We break the assumption of the fixed layer order in the Transformer and introduce instance-wise layer reordering into the model structure. Our method can also be applied to other architectures beyond Transformer.
arXiv Detail & Related papers (2021-03-05T03:44:42Z)
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications. We propose a simple but effective method, DeeBERT, to accelerate BERT inference. Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.