BiT: Robustly Binarized Multi-distilled Transformer
- URL: http://arxiv.org/abs/2205.13016v1
- Date: Wed, 25 May 2022 19:01:54 GMT
- Title: BiT: Robustly Binarized Multi-distilled Transformer
- Authors: Zechun Liu, Barlas Oguz, Aasish Pappu, Lin Xiao, Scott Yih, Meng Li,
Raghuraman Krishnamoorthi, Yashar Mehdad
- Abstract summary: We develop binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline within as little as 5.9%.
These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline within as little as 5.9%.
- Score: 36.06192421902272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern pre-trained transformers have rapidly advanced the state-of-the-art in
machine learning, but have also grown in parameters and computational
complexity, making them increasingly difficult to deploy in
resource-constrained environments. Binarization of the weights and activations
of the network can significantly alleviate these issues, however is technically
challenging from an optimization perspective. In this work, we identify a
series of improvements which enables binary transformers at a much higher
accuracy than what was possible previously. These include a two-set
binarization scheme, a novel elastic binary activation function with learned
parameters, and a method to quantize a network to its limit by successively
distilling higher precision models into lower precision students. These
approaches allow for the first time, fully binarized transformer models that
are at a practical level of accuracy, approaching a full-precision BERT
baseline on the GLUE language understanding benchmark within as little as 5.9%.
Related papers
- Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - BiPFT: Binary Pre-trained Foundation Transformer with Low-rank Estimation of Binarization Residual Polynomials [27.573329030086676]
This work proposes the first Binary Pretrained Foundation Transformer (BiPFT) for natural language understanding (NLU) tasks.
BiPFT exhibits a substantial enhancement in the learning capabilities of binary neural networks (BNNs)
Extensive experiments validate the effectiveness of BiPFTs, surpassing task-specific baseline by 15.4% average performance on the GLUE benchmark.
arXiv Detail & Related papers (2023-12-14T13:42:57Z) - Partial Tensorized Transformers for Natural Language Processing [0.0]
We study the effect of tensor-train decomposition to improve the accuracy and compress vision-language neural networks, namely BERT and ViT.
Our novel PTNN approach significantly improves the accuracy of existing models by up to 5%, all without the need for post-training adjustments.
arXiv Detail & Related papers (2023-10-30T23:19:06Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision [45.69716658698776]
In this paper, we identify the difficulty of transformer low-bit quantization-aware training on its unique variation behaviors.
We propose a variation-aware quantization scheme for both vision and language transformers.
Our solution substantially improves the 2-bit Swin-T and binary BERT-base, achieving a 3.35% and 1.4% accuracy improvement.
arXiv Detail & Related papers (2023-07-01T13:01:39Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - Understanding and Overcoming the Challenges of Efficient Transformer
Quantization [17.05322956052278]
Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks.
However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices.
We show that transformers have unique quantization challenges -- namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format.
arXiv Detail & Related papers (2021-09-27T10:57:18Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.