Quantized Transformer Language Model Implementations on Edge Devices
- URL: http://arxiv.org/abs/2310.03971v1
- Date: Fri, 6 Oct 2023 01:59:19 GMT
- Title: Quantized Transformer Language Model Implementations on Edge Devices
- Authors: Mohammad Wali Ur Rahman, Murad Mehrab Abrar, Hunter Gibbons Copening,
Salim Hariri, Sicong Shao, Pratik Satam, and Soheil Salehi
- Abstract summary: Large-scale transformer-based models like the Bidirectional Representations from Transformers (BERT) are widely used for Natural Language Processing (NLP) applications.
These models are initially pre-trained with a large corpus with millions of parameters and then fine-tuned for a downstream NLP task.
One of the major limitations of these large-scale models is that they cannot be deployed on resource-constrained devices due to their large model size and increased inference latency.
- Score: 1.2979415757860164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale transformer-based models like the Bidirectional Encoder
Representations from Transformers (BERT) are widely used for Natural Language
Processing (NLP) applications, wherein these models are initially pre-trained
with a large corpus with millions of parameters and then fine-tuned for a
downstream NLP task. One of the major limitations of these large-scale models
is that they cannot be deployed on resource-constrained devices due to their
large model size and increased inference latency. In order to overcome these
limitations, such large-scale models can be converted to an optimized
FlatBuffer format, tailored for deployment on resource-constrained edge
devices. Herein, we evaluate the performance of such FlatBuffer transformed
MobileBERT models on three different edge devices, fine-tuned for Reputation
analysis of English language tweets in the RepLab 2013 dataset. In addition,
this study encompassed an evaluation of the deployed models, wherein their
latency, performance, and resource efficiency were meticulously assessed. Our
experiment results show that, compared to the original BERT large model, the
converted and quantized MobileBERT models have 160$\times$ smaller footprints
for a 4.1% drop in accuracy while analyzing at least one tweet per second on
edge devices. Furthermore, our study highlights the privacy-preserving aspect
of TinyML systems as all data is processed locally within a serverless
environment.
Related papers
- Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors.
We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor.
Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z) - Efficient Machine Translation with a BiLSTM-Attention Approach [0.0]
This paper proposes a novel Seq2Seq model aimed at improving translation quality while reducing the storage space required by the model.
The model employs a Bidirectional Long Short-Term Memory network (Bi-LSTM) as the encoder to capture the context information of the input sequence.
Compared to the current mainstream Transformer model, our model achieves superior performance on the WMT14 machine translation dataset.
arXiv Detail & Related papers (2024-10-29T01:12:50Z) - MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints.
We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B.
We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - Extensive Evaluation of Transformer-based Architectures for Adverse Drug
Events Extraction [6.78974856327994]
Adverse Event (ADE) extraction is one of the core tasks in digital pharmacovigilance.
We evaluate 19 Transformer-based models for ADE extraction on informal texts.
At the end of our analyses, we identify a list of take-home messages that can be derived from the experimental data.
arXiv Detail & Related papers (2023-06-08T15:25:24Z) - Speculative Decoding with Big Little Decoder [108.95187338417541]
Big Little Decoder (BiLD) is a framework that can improve inference efficiency and latency for a wide range of text generation applications.
On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation.
Our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture.
arXiv Detail & Related papers (2023-02-15T18:55:29Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Error Detection in Large-Scale Natural Language Understanding Systems
Using Transformer Models [0.0]
Large-scale conversational assistants like Alexa, Siri, Cortana and Google Assistant process every utterance using multiple models for domain, intent and named entity recognition.
We address this challenge to detect domain classification errors using offline Transformer models.
We combine utterance encodings from a RoBERTa model with the Nbest hypothesis produced by the production system. We then fine-tune end-to-end in a multitask setting using a small dataset of humanannotated utterances with domain classification errors.
arXiv Detail & Related papers (2021-09-04T00:10:48Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Real-Time Execution of Large-scale Language Models on Mobile [49.32610509282623]
We find the best model structure of BERT for a given computation size to match specific devices.
Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices.
Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
arXiv Detail & Related papers (2020-09-15T01:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.