BiPFT: Binary Pre-trained Foundation Transformer with Low-rank Estimation of Binarization Residual Polynomials
- URL: http://arxiv.org/abs/2312.08937v2
- Date: Thu, 20 Jun 2024 05:19:54 GMT
- Title: BiPFT: Binary Pre-trained Foundation Transformer with Low-rank Estimation of Binarization Residual Polynomials
- Authors: Xingrun Xing, Li Du, Xinyuan Wang, Xianlin Zeng, Yequan Wang, Zheng Zhang, Jiajun Zhang,
- Abstract summary: This work proposes the first Binary Pretrained Foundation Transformer (BiPFT) for natural language understanding (NLU) tasks.
BiPFT exhibits a substantial enhancement in the learning capabilities of binary neural networks (BNNs)
Extensive experiments validate the effectiveness of BiPFTs, surpassing task-specific baseline by 15.4% average performance on the GLUE benchmark.
- Score: 27.573329030086676
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Pretrained foundation models offer substantial benefits for a wide range of downstream tasks, which can be one of the most potential techniques to access artificial general intelligence. However, scaling up foundation transformers for maximal task-agnostic knowledge has brought about computational challenges, especially on resource-limited devices such as mobiles. This work proposes the first Binary Pretrained Foundation Transformer (BiPFT) for natural language understanding (NLU) tasks, which remarkably saves 56 times operations and 28 times memory. In contrast to previous task-specific binary transformers, BiPFT exhibits a substantial enhancement in the learning capabilities of binary neural networks (BNNs), promoting BNNs into the era of pre-training. Benefiting from extensive pretraining data, we further propose a data-driven binarization method. Specifically, we first analyze the binarization error in self-attention operations and derive the polynomials of binarization error. To simulate full-precision self-attention, we define binarization error as binarization residual polynomials, and then introduce low-rank estimators to model these polynomials. Extensive experiments validate the effectiveness of BiPFTs, surpassing task-specific baseline by 15.4% average performance on the GLUE benchmark. BiPFT also demonstrates improved robustness to hyperparameter changes, improved optimization efficiency, and reduced reliance on downstream distillation, which consequently generalize on various NLU tasks and simplify the downstream pipeline of BNNs. Our code and pretrained models are publicly available at https://github.com/Xingrun-Xing/BiPFT.
Related papers
- BiDense: Binarization for Dense Prediction [62.70804353158387]
BiDense is a generalized binary neural network (BNN) designed for efficient and accurate dense prediction tasks.
BiDense incorporates two key techniques: the Distribution-adaptive Binarizer (DAB) and the Channel-adaptive Full-precision Bypass (CFB)
arXiv Detail & Related papers (2024-11-15T16:46:04Z) - Efficient Multitask Dense Predictor via Binarization [19.5100813204537]
We introduce network binarization to compress resource-intensive multi-task dense predictors.
We propose a Binary Multi-task Dense Predictor, Bi-MTDP, and several variants of Bi-MTDP.
One variant of Bi-MTDP outperforms full-precision (FP) multi-task dense prediction SoTAs, ARTC (CNN-based) and InvPT (ViT-Based)
arXiv Detail & Related papers (2024-05-23T03:19:23Z) - Projected Stochastic Gradient Descent with Quantum Annealed Binary Gradients [51.82488018573326]
We present QP-SBGD, a novel layer-wise optimiser tailored towards training neural networks with binary weights.
BNNs reduce the computational requirements and energy consumption of deep learning models with minimal loss in accuracy.
Our algorithm is implemented layer-wise, making it suitable to train larger networks on resource-limited quantum hardware.
arXiv Detail & Related papers (2023-10-23T17:32:38Z) - Input Layer Binarization with Bit-Plane Encoding [4.872439392746007]
We present a new method to binarize the first layer using directly the 8-bit representation of input data.
The resulting model is fully binarized and our first layer binarization approach is model independent.
arXiv Detail & Related papers (2023-05-04T14:49:07Z) - Decouple Graph Neural Networks: Train Multiple Simple GNNs Simultaneously Instead of One [60.5818387068983]
Graph neural networks (GNN) suffer from severe inefficiency.
We propose to decouple a multi-layer GNN as multiple simple modules for more efficient training.
We show that the proposed framework is highly efficient with reasonable performance.
arXiv Detail & Related papers (2023-04-20T07:21:32Z) - Recurrent Bilinear Optimization for Binary Neural Networks [58.972212365275595]
BNNs neglect the intrinsic bilinear relationship of real-valued weights and scale factors.
Our work is the first attempt to optimize BNNs from the bilinear perspective.
We obtain robust RBONNs, which show impressive performance over state-of-the-art BNNs on various models and datasets.
arXiv Detail & Related papers (2022-09-04T06:45:33Z) - Network Binarization via Contrastive Learning [16.274341164897827]
We establish a novel contrastive learning framework while training Binary Neural Networks (BNNs)
MI is introduced as the metric to measure the information shared between binary and FP activations.
Results show that our method can be implemented as a pile-up module on existing state-of-the-art binarization methods.
arXiv Detail & Related papers (2022-07-06T21:04:53Z) - BiT: Robustly Binarized Multi-distilled Transformer [36.06192421902272]
We develop binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline within as little as 5.9%.
These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline within as little as 5.9%.
arXiv Detail & Related papers (2022-05-25T19:01:54Z) - Bias-Variance Tradeoffs in Single-Sample Binary Gradient Estimators [100.58924375509659]
Straight-through (ST) estimator gained popularity due to its simplicity and efficiency.
Several techniques were proposed to improve over ST while keeping the same low computational complexity.
We conduct a theoretical analysis of Bias and Variance of these methods in order to understand tradeoffs and verify originally claimed properties.
arXiv Detail & Related papers (2021-10-07T15:16:07Z) - A Meta-Learning Approach to the Optimal Power Flow Problem Under
Topology Reconfigurations [69.73803123972297]
We propose a DNN-based OPF predictor that is trained using a meta-learning (MTL) approach.
The developed OPF-predictor is validated through simulations using benchmark IEEE bus systems.
arXiv Detail & Related papers (2020-12-21T17:39:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.