Binarized Neural Machine Translation
- URL: http://arxiv.org/abs/2302.04907v1
- Date: Thu, 9 Feb 2023 19:27:34 GMT
- Title: Binarized Neural Machine Translation
- Authors: Yichi Zhang, Ankush Garg, Yuan Cao,{\L}ukasz Lew, Behrooz Ghorbani,
Zhiru Zhang, Orhan Firat
- Abstract summary: We propose a novel binarization technique for Transformers applied to machine translation (BMT)
We identify and address the problem of inflated dot-product variance when using one-bit weights and activations.
Experiments on the WMT dataset show that a one-bit weight-only Transformer can achieve the same quality as a float one, while being 16x smaller in size.
- Score: 43.488431560851204
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid scaling of language models is motivating research using
low-bitwidth quantization. In this work, we propose a novel binarization
technique for Transformers applied to machine translation (BMT), the first of
its kind. We identify and address the problem of inflated dot-product variance
when using one-bit weights and activations. Specifically, BMT leverages
additional LayerNorms and residual connections to improve binarization quality.
Experiments on the WMT dataset show that a one-bit weight-only Transformer can
achieve the same quality as a float one, while being 16x smaller in size.
One-bit activations incur varying degrees of quality drop, but mitigated by the
proposed architectural changes. We further conduct a scaling law study using
production-scale translation datasets, which shows that one-bit weight
Transformers scale and generalize well in both in-domain and out-of-domain
settings. Implementation in JAX/Flax will be open sourced.
Related papers
- Efficient Machine Translation with a BiLSTM-Attention Approach [0.0]
This paper proposes a novel Seq2Seq model aimed at improving translation quality while reducing the storage space required by the model.
The model employs a Bidirectional Long Short-Term Memory network (Bi-LSTM) as the encoder to capture the context information of the input sequence.
Compared to the current mainstream Transformer model, our model achieves superior performance on the WMT14 machine translation dataset.
arXiv Detail & Related papers (2024-10-29T01:12:50Z) - MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - Quantized Transformer Language Model Implementations on Edge Devices [1.2979415757860164]
Large-scale transformer-based models like the Bidirectional Representations from Transformers (BERT) are widely used for Natural Language Processing (NLP) applications.
These models are initially pre-trained with a large corpus with millions of parameters and then fine-tuned for a downstream NLP task.
One of the major limitations of these large-scale models is that they cannot be deployed on resource-constrained devices due to their large model size and increased inference latency.
arXiv Detail & Related papers (2023-10-06T01:59:19Z) - TranSFormer: Slow-Fast Transformer for Machine Translation [52.12212173775029]
We present a textbfSlow-textbfFast two-stream learning model, referred to as TrantextbfSFormer.
Our TranSFormer shows consistent BLEU improvements (larger than 1 BLEU point) on several machine translation benchmarks.
arXiv Detail & Related papers (2023-05-26T14:37:38Z) - Data Scaling Laws in NMT: The Effect of Noise and Architecture [59.767899982937756]
We study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT)
We find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data.
arXiv Detail & Related papers (2022-02-04T06:53:49Z) - What's Hidden in a One-layer Randomly Weighted Transformer? [100.98342094831334]
Hidden within one-layer randomly weighted neural networks, there existworks that can achieve impressive performance.
Using a fixed pre-trained embedding layer, the previously foundworks are smaller than, but can match 98%/92% (34.14/25.24 BLEU) of the performance of, a trained Transformer small/base on IWSLT14/WMT14.
arXiv Detail & Related papers (2021-09-08T21:22:52Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Extremely Low Bit Transformer Quantization for On-Device Neural Machine
Translation [9.770173256808844]
We propose a mixed precision quantization strategy to represent Transformer weights by an extremely low number of bits.
Our model achieves 11.8$times$ smaller model size than the baseline model, with less than -0.5 BLEU.
We achieve 8.3$times$ reduction in run-time memory footprints and 3.5$times$ speed up.
arXiv Detail & Related papers (2020-09-16T03:58:01Z) - Variational Neural Machine Translation with Normalizing Flows [13.537869825364718]
Variational Neural Machine Translation (VNMT) is an attractive framework for modeling the generation of target translations.
We propose to apply the VNMT framework to the state-of-the-art Transformer and introduce a more flexible approximate posterior based on normalizing flows.
arXiv Detail & Related papers (2020-05-28T13:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.