Real-Time Execution of Large-scale Language Models on Mobile
- URL: http://arxiv.org/abs/2009.06823v2
- Date: Thu, 22 Oct 2020 17:53:07 GMT
- Title: Real-Time Execution of Large-scale Language Models on Mobile
- Authors: Wei Niu, Zhenglun Kong, Geng Yuan, Weiwen Jiang, Jiexiong Guan, Caiwen
Ding, Pu Zhao, Sijia Liu, Bin Ren, Yanzhi Wang
- Abstract summary: We find the best model structure of BERT for a given computation size to match specific devices.
Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices.
Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
- Score: 49.32610509282623
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained large-scale language models have increasingly demonstrated high
accuracy on many natural language processing (NLP) tasks. However, the limited
weight storage and computational speed on hardware platforms have impeded the
popularity of pre-trained models, especially in the era of edge computing. In
this paper, we seek to find the best model structure of BERT for a given
computation size to match specific devices. We propose the first compiler-aware
neural architecture optimization framework. Our framework can guarantee the
identified model to meet both resource and real-time specifications of mobile
devices, thus achieving real-time execution of large transformer-based models
like BERT variants. We evaluate our model on several NLP tasks, achieving
competitive results on well-known benchmarks with lower latency on mobile
devices. Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU
with 0.5-2% accuracy loss compared with BERT-base. Our overall framework
achieves up to 7.8x speedup compared with TensorFlow-Lite with only minor
accuracy loss.
Related papers
- Quantized Transformer Language Model Implementations on Edge Devices [1.2979415757860164]
Large-scale transformer-based models like the Bidirectional Representations from Transformers (BERT) are widely used for Natural Language Processing (NLP) applications.
These models are initially pre-trained with a large corpus with millions of parameters and then fine-tuned for a downstream NLP task.
One of the major limitations of these large-scale models is that they cannot be deployed on resource-constrained devices due to their large model size and increased inference latency.
arXiv Detail & Related papers (2023-10-06T01:59:19Z) - Speculative Decoding with Big Little Decoder [108.95187338417541]
Big Little Decoder (BiLD) is a framework that can improve inference efficiency and latency for a wide range of text generation applications.
On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation.
Our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture.
arXiv Detail & Related papers (2023-02-15T18:55:29Z) - Accelerating Deep Learning Model Inference on Arm CPUs with Ultra-Low
Bit Quantization and Runtime [57.5143536744084]
High performance of deep learning models comes at the expense of high computational, storage and power requirements.
We introduce Deeplite Neutrino for production-ready optimization of the models and Deeplite for deployment of ultra-low bit quantized models on Arm-based platforms.
arXiv Detail & Related papers (2022-07-18T15:05:17Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - A Compression-Compilation Framework for On-mobile Real-time BERT
Applications [36.54139770775837]
Transformer-based deep learning models have increasingly demonstrated high accuracy on many natural language processing (NLP) tasks.
We propose a compression-compilation co-design framework that can guarantee the identified model to meet both resource and real-time specifications of mobile devices.
We present two types of BERT applications on mobile devices: Question Answering (QA) and Text Generation.
arXiv Detail & Related papers (2021-05-30T16:19:11Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z) - It's always personal: Using Early Exits for Efficient On-Device CNN
Personalisation [19.046126301352274]
On-device machine learning is becoming a reality thanks to the availability of powerful hardware and model compression techniques.
In this work, we observe that a much smaller, personalised model can be employed to fit a specific scenario.
We introduce PershonEPEE, a framework that attaches early exits on the model and personalises them on-device.
arXiv Detail & Related papers (2021-02-02T09:10:17Z) - Efficient Transformer-based Large Scale Language Representations using
Hardware-friendly Block Structured Pruning [12.761055946548437]
We propose an efficient transformer-based large-scale language representation using hardware-friendly block structure pruning.
Besides the significantly reduced weight storage and computation, the proposed approach achieves high compression rates.
It is suitable to deploy the final compressed model on resource-constrained edge devices.
arXiv Detail & Related papers (2020-09-17T04:45:47Z) - Finding Fast Transformers: One-Shot Neural Architecture Search by
Component Composition [11.6409723227448]
Transformer-based models have achieved stateof-the-art results in many tasks in natural language processing.
We develop an efficient algorithm to search for fast models while maintaining model quality.
arXiv Detail & Related papers (2020-08-15T23:12:25Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.