Related papers: Binary and Ternary Natural Language Generation

Binary and Ternary Natural Language Generation

URL: http://arxiv.org/abs/2306.01841v1
Date: Fri, 2 Jun 2023 18:01:02 GMT
Title: Binary and Ternary Natural Language Generation
Authors: Zechun Liu, Barlas Oguz, Aasish Pappu, Yangyang Shi, Raghuraman Krishnamoorthi
Abstract summary: Ternary and binary neural networks enable multiplication-free computation. They promise multiple orders of magnitude efficiency gains over full-precision networks. However, such networks have proven very difficult to optimize. We show first ternary and binary transformer models on the downstream tasks of summarization and machine translation.
Score: 24.295815261826153
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Ternary and binary neural networks enable multiplication-free computation and promise multiple orders of magnitude efficiency gains over full-precision networks if implemented on specialized hardware. However, since both the parameter and the output space are highly discretized, such networks have proven very difficult to optimize. The difficulties are compounded for the class of transformer text generation models due to the sensitivity of the attention operation to quantization and the noise-compounding effects of autoregressive decoding in the high-cardinality output space. We approach the problem with a mix of statistics-based quantization for the weights and elastic quantization of the activations and demonstrate the first ternary and binary transformer models on the downstream tasks of summarization and machine translation. Our ternary BART base achieves an R1 score of 41 on the CNN/DailyMail benchmark, which is merely 3.9 points behind the full model while being 16x more efficient. Our binary model, while less accurate, achieves a highly non-trivial score of 35.6. For machine translation, we achieved BLEU scores of 21.7 and 17.6 on the WMT16 En-Ro benchmark, compared with a full precision mBART model score of 26.8. We also compare our approach in the 8-bit activation setting, where our ternary and even binary weight models can match or outperform the best existing 8-bit weight models in the literature. Our code and models are available at: https://github.com/facebookresearch/Ternary_Binary_Transformer

Related papers

Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors. We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z)
The Mamba in the Llama: Distilling and Accelerating Hybrid Models [76.64055251296548]
We show how to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model achieves performance comparable to the original Transformer in chat benchmarks. We also introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models.
arXiv Detail & Related papers (2024-08-27T17:56:11Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning [90.02873747873444]
Bit Diffusion is a generic approach for generating discrete data with continuous diffusion models. The proposed approach can achieve strong performance in both discrete image generation and image captioning tasks. For image captioning on MS-COCO dataset, our approach achieves competitive results compared to autoregressive models.
arXiv Detail & Related papers (2022-08-08T15:08:40Z)
Soft Threshold Ternary Networks [36.722958963130665]
In previous ternarized neural networks, a hard threshold Delta is introduced to determine quantization intervals. We present the Soft Threshold Ternary Networks (STTN), which enables the model to automatically determine quantization intervals. Our method dramatically outperforms current state-of-the-arts, lowering the performance gap between full-precision networks and extreme low bit networks.
arXiv Detail & Related papers (2022-04-04T04:43:42Z)
8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values. This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters. In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z)
Binarization Methods for Motor-Imagery Brain-Computer Interface Classification [18.722731794073756]
We propose methods for transforming real-valued weights to binary numbers for efficient inference. By tuning the dimension of the binary embedding, we achieve almost the same accuracy in 4-class MI ($leq$1.27% lower) compared to models with float16 weights. Our method replaces the fully connected layer of CNNs with a binary augmented memory using bipolar random projection.
arXiv Detail & Related papers (2020-10-14T12:28:18Z)
Towards Fully 8-bit Integer Inference for the Transformer Model [39.22272841663168]
We show that after a principled modification on the Transformer architecture, dubbed Transformer, an (almost) fully 8-bit integer inference algorithm could be derived. Our experiments on WMT16 En->Ro, WMT14 En->De and En->Fr translation tasks as well as the WikiText-103 language modelling task show that the fully 8-bit Transformer system achieves comparable performance with the floating point baseline but requires nearly 4x less memory footprint.
arXiv Detail & Related papers (2020-09-17T03:09:10Z)
FATNN: Fast and Accurate Ternary Neural Networks [89.07796377047619]
Ternary Neural Networks (TNNs) have received much attention due to being potentially orders of magnitude faster in inference, as well as more power efficient, than full-precision counterparts. In this work, we show that, under some mild constraints, computational complexity of the ternary inner product can be reduced by a factor of 2. We elaborately design an implementation-dependent ternary quantization algorithm to mitigate the performance gap.
arXiv Detail & Related papers (2020-08-12T04:26:18Z)
Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR) We propose the convolution-augmented transformer for speech recognition, named Conformer. On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.