A Compression-Compilation Framework for On-mobile Real-time BERT
Applications
- URL: http://arxiv.org/abs/2106.00526v1
- Date: Sun, 30 May 2021 16:19:11 GMT
- Title: A Compression-Compilation Framework for On-mobile Real-time BERT
Applications
- Authors: Wei Niu, Zhenglun Kong, Geng Yuan, Weiwen Jiang, Jiexiong Guan, Caiwen
Ding, Pu Zhao, Sijia Liu, Bin Ren, Yanzhi Wang
- Abstract summary: Transformer-based deep learning models have increasingly demonstrated high accuracy on many natural language processing (NLP) tasks.
We propose a compression-compilation co-design framework that can guarantee the identified model to meet both resource and real-time specifications of mobile devices.
We present two types of BERT applications on mobile devices: Question Answering (QA) and Text Generation.
- Score: 36.54139770775837
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based deep learning models have increasingly demonstrated high
accuracy on many natural language processing (NLP) tasks. In this paper, we
propose a compression-compilation co-design framework that can guarantee the
identified model to meet both resource and real-time specifications of mobile
devices. Our framework applies a compiler-aware neural architecture
optimization method (CANAO), which can generate the optimal compressed model
that balances both accuracy and latency. We are able to achieve up to 7.8x
speedup compared with TensorFlow-Lite with only minor accuracy loss. We present
two types of BERT applications on mobile devices: Question Answering (QA) and
Text Generation. Both can be executed in real-time with latency as low as 45ms.
Videos for demonstrating the framework can be found on
https://www.youtube.com/watch?v=_WIRvK_2PZI
Related papers
- Long Context Compression with Activation Beacon [22.054232261437186]
Activation Beacon is a plug-in module for transformer-based LLMs.
It targets effective, efficient, and flexible compression of long contexts.
It achieves a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache.
arXiv Detail & Related papers (2024-01-07T11:57:40Z) - Practical Conformer: Optimizing size, speed and flops of Conformer for
on-Device and cloud ASR [67.63332492134332]
We design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs.
Our proposed encoder can double as a strong standalone encoder in on device, and as the first part of a high-performance ASR pipeline.
arXiv Detail & Related papers (2023-03-31T23:30:48Z) - Seer: Language Instructed Video Prediction with Latent Diffusion Models [43.708550061909754]
Text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning.
We propose a sample and computation-efficient model, named textbfSeer, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis.
With the adaptable-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames.
arXiv Detail & Related papers (2023-03-27T03:12:24Z) - Speculative Decoding with Big Little Decoder [108.95187338417541]
Big Little Decoder (BiLD) is a framework that can improve inference efficiency and latency for a wide range of text generation applications.
On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation.
Our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture.
arXiv Detail & Related papers (2023-02-15T18:55:29Z) - NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural
Architecture Search [100.71365025972258]
We propose NAS-BERT, an efficient method for BERT compression.
NAS-BERT trains a big supernet on a search space and outputs multiple compressed models with adaptive sizes and latency.
Experiments on GLUE and SQuAD benchmark datasets demonstrate that NAS-BERT can find lightweight models with better accuracy than previous approaches.
arXiv Detail & Related papers (2021-05-30T07:20:27Z) - Real-Time Execution of Large-scale Language Models on Mobile [49.32610509282623]
We find the best model structure of BERT for a given computation size to match specific devices.
Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices.
Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
arXiv Detail & Related papers (2020-09-15T01:59:17Z) - YOLObile: Real-Time Object Detection on Mobile Devices via
Compression-Compilation Co-Design [38.98949683262209]
We propose YOLObile framework, a real-time object detection on mobile devices via compression-compilation co-design.
A novel block-punched pruning scheme is proposed for any kernel size.
Under our YOLObile framework, we achieve 17 FPS inference speed using GPU on Samsung Galaxy S20.
arXiv Detail & Related papers (2020-09-12T01:41:08Z) - RT3D: Achieving Real-Time Execution of 3D Convolutional Neural Networks
on Mobile Devices [57.877112704841366]
This paper proposes RT3D, a model compression and mobile acceleration framework for 3D CNNs.
For the first time, real-time execution of 3D CNNs is achieved on off-the-shelf mobiles.
arXiv Detail & Related papers (2020-07-20T02:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.