Related papers: Fast DistilBERT on CPUs

Fast DistilBERT on CPUs

URL: http://arxiv.org/abs/2211.07715v1
Date: Thu, 27 Oct 2022 07:22:50 GMT
Title: Fast DistilBERT on CPUs
Authors: Haihao Shen, Ofir Zafrir, Bo Dong, Hengyu Meng, Xinyu Ye, Zhe Wang, Yi Ding, Hanwen Chang, Guy Boudoukh, and Moshe Wasserblat
Abstract summary: Transformer-based language models have become the standard approach to solving natural language processing tasks. Industry adoption usually requires the maximum throughput to comply with certain latency constraints. We propose a new pipeline for creating and running Fast Transformer models on CPUs, utilizing hardware-aware pruning, knowledge distillation, quantization, and our own Transformer inference runtime engine with optimized kernels for sparse and quantized operators.
Score: 13.29188219884869
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer-based language models have become the standard approach to solving natural language processing tasks. However, industry adoption usually requires the maximum throughput to comply with certain latency constraints that prevents Transformer models from being used in production. To address this gap, model compression techniques such as quantization and pruning may be used to improve inference efficiency. However, these compression techniques require specialized software to apply and deploy at scale. In this work, we propose a new pipeline for creating and running Fast Transformer models on CPUs, utilizing hardware-aware pruning, knowledge distillation, quantization, and our own Transformer inference runtime engine with optimized kernels for sparse and quantized operators. We demonstrate the efficiency of our pipeline by creating a Fast DistilBERT model showing minimal accuracy loss on the question-answering SQuADv1.1 benchmark, and throughput results under typical production constraints and environments. Our results outperform existing state-of-the-art Neural Magic's DeepSparse runtime performance by up to 50% and up to 4.1x performance speedup over ONNX Runtime.

Related papers

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs [12.883586189626431]
Transformer-based language models have become the standard approach for natural language processing tasks. Most existing neural network inference runtimes lack adequate support for structured sparsity. We propose an efficient sparse deep learning inference software stack for Transformer-based language models.
arXiv Detail & Related papers (2023-06-28T23:55:51Z)
Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model [0.0]
Excessive overhead leads to large latency and computational costs. We propose a model accelaration approaches for large language models. Our model achieves an 18x FLOPs speedup with an accuracy degradation of less than 8% compared to BERT.
arXiv Detail & Related papers (2023-05-21T13:30:56Z)
TransCODE: Co-design of Transformers and Accelerators for Efficient Training and Inference [6.0093441900032465]
We propose a framework that simulates transformer inference and training on a design space of accelerators. We use this simulator in conjunction with the proposed co-design technique, called TransCODE, to obtain the best-performing models. The obtained transformer-accelerator pair achieves 0.3% higher accuracy than the state-of-the-art pair.
arXiv Detail & Related papers (2023-03-27T02:45:18Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions. We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z)
A Fast Post-Training Pruning Framework for Transformers [74.59556951906468]
Pruning is an effective way to reduce the huge inference cost of large Transformer models. Prior work on model pruning requires retraining the model. We propose a fast post-training pruning framework for Transformers that does not require any retraining.
arXiv Detail & Related papers (2022-03-29T07:41:11Z)
Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z)
EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z)
Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning [12.761055946548437]
We propose an efficient transformer-based large-scale language representation using hardware-friendly block structure pruning. Besides the significantly reduced weight storage and computation, the proposed approach achieves high compression rates. It is suitable to deploy the final compressed model on resource-constrained edge devices.
arXiv Detail & Related papers (2020-09-17T04:45:47Z)
Real-Time Execution of Large-scale Language Models on Mobile [49.32610509282623]
We find the best model structure of BERT for a given computation size to match specific devices. Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices. Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
arXiv Detail & Related papers (2020-09-15T01:59:17Z)
Accelerating Natural Language Understanding in Task-Oriented Dialog [6.757982879080109]
We show that a simple convolutional model compressed with structured pruning achieves largely comparable results to BERT on ATIS and Snips, with under 100K parameters. We also perform acceleration experiments on CPUs, where we observe our multi-task model predicts intents and slots nearly 63x faster than even DistilBERT.
arXiv Detail & Related papers (2020-06-05T21:36:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.