Fast DistilBERT on CPUs
- URL: http://arxiv.org/abs/2211.07715v1
- Date: Thu, 27 Oct 2022 07:22:50 GMT
- Title: Fast DistilBERT on CPUs
- Authors: Haihao Shen, Ofir Zafrir, Bo Dong, Hengyu Meng, Xinyu Ye, Zhe Wang, Yi
Ding, Hanwen Chang, Guy Boudoukh, and Moshe Wasserblat
- Abstract summary: Transformer-based language models have become the standard approach to solving natural language processing tasks.
Industry adoption usually requires the maximum throughput to comply with certain latency constraints.
We propose a new pipeline for creating and running Fast Transformer models on CPUs, utilizing hardware-aware pruning, knowledge distillation, quantization, and our own Transformer inference runtime engine with optimized kernels for sparse and quantized operators.
- Score: 13.29188219884869
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based language models have become the standard approach to
solving natural language processing tasks. However, industry adoption usually
requires the maximum throughput to comply with certain latency constraints that
prevents Transformer models from being used in production. To address this gap,
model compression techniques such as quantization and pruning may be used to
improve inference efficiency. However, these compression techniques require
specialized software to apply and deploy at scale. In this work, we propose a
new pipeline for creating and running Fast Transformer models on CPUs,
utilizing hardware-aware pruning, knowledge distillation, quantization, and our
own Transformer inference runtime engine with optimized kernels for sparse and
quantized operators. We demonstrate the efficiency of our pipeline by creating
a Fast DistilBERT model showing minimal accuracy loss on the question-answering
SQuADv1.1 benchmark, and throughput results under typical production
constraints and environments. Our results outperform existing state-of-the-art
Neural Magic's DeepSparse runtime performance by up to 50% and up to 4.1x
performance speedup over ONNX Runtime.
Related papers
- DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams [63.27233749591346]
Transformer-based models have dramatically increased their size and parameter count to tackle increasingly complex tasks.<n>Stream data inference is typically performed over a sliding temporal window, leading to highly redundant computations.<n>We propose the Deep Continual Transformer (DeepCoT), a redundancy-free encoder-only model that can be applied over existing deep encoder architectures with minimal changes.
arXiv Detail & Related papers (2025-11-21T16:15:43Z) - QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations [16.476647190730876]
QUARK is a quantization-enabled FPGA acceleration framework.<n>It targets all nonlinear operations within Transformer-based models.<n>It achieves high-performance approximation through a novel circuit-sharing design.
arXiv Detail & Related papers (2025-11-10T06:46:21Z) - The Fast for the Curious: How to accelerate fault-tolerant quantum applications [101.46859364118622]
We evaluate strategies for reducing the run time of fault-tolerant quantum computations.<n>We discuss how the co-design of hardware, fault tolerance, and algorithmic subroutines can reduce run times.
arXiv Detail & Related papers (2025-10-30T02:27:55Z) - Fast and Compact Tsetlin Machine Inference on CPUs Using Instruction-Level Optimization [0.4499833362998488]
The Tsetlin Machine (TM) offers high-speed inference on resource-constrained devices such as CPUs.<n>We propose an efficient software implementation of the TM by leveraging instruction-level bitwise operations.<n>We introduce an early exit mechanism, which exploits the TM's AND-based clause evaluation to avoid unnecessary computations.
arXiv Detail & Related papers (2025-10-17T13:44:20Z) - An Efficient Sparse Inference Software Accelerator for Transformer-based
Language Models on CPUs [12.883586189626431]
Transformer-based language models have become the standard approach for natural language processing tasks.
Most existing neural network inference runtimes lack adequate support for structured sparsity.
We propose an efficient sparse deep learning inference software stack for Transformer-based language models.
arXiv Detail & Related papers (2023-06-28T23:55:51Z) - Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for
Compact and Efficient language model [0.0]
Excessive overhead leads to large latency and computational costs.
We propose a model accelaration approaches for large language models.
Our model achieves an 18x FLOPs speedup with an accuracy degradation of less than 8% compared to BERT.
arXiv Detail & Related papers (2023-05-21T13:30:56Z) - TransCODE: Co-design of Transformers and Accelerators for Efficient
Training and Inference [6.0093441900032465]
We propose a framework that simulates transformer inference and training on a design space of accelerators.
We use this simulator in conjunction with the proposed co-design technique, called TransCODE, to obtain the best-performing models.
The obtained transformer-accelerator pair achieves 0.3% higher accuracy than the state-of-the-art pair.
arXiv Detail & Related papers (2023-03-27T02:45:18Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - A Fast Post-Training Pruning Framework for Transformers [74.59556951906468]
Pruning is an effective way to reduce the huge inference cost of large Transformer models.
Prior work on model pruning requires retraining the model.
We propose a fast post-training pruning framework for Transformers that does not require any retraining.
arXiv Detail & Related papers (2022-03-29T07:41:11Z) - Ps and Qs: Quantization-aware pruning for efficient low latency neural
network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications.
We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z) - Efficient Transformer-based Large Scale Language Representations using
Hardware-friendly Block Structured Pruning [12.761055946548437]
We propose an efficient transformer-based large-scale language representation using hardware-friendly block structure pruning.
Besides the significantly reduced weight storage and computation, the proposed approach achieves high compression rates.
It is suitable to deploy the final compressed model on resource-constrained edge devices.
arXiv Detail & Related papers (2020-09-17T04:45:47Z) - Real-Time Execution of Large-scale Language Models on Mobile [49.32610509282623]
We find the best model structure of BERT for a given computation size to match specific devices.
Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices.
Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
arXiv Detail & Related papers (2020-09-15T01:59:17Z) - Accelerating Natural Language Understanding in Task-Oriented Dialog [6.757982879080109]
We show that a simple convolutional model compressed with structured pruning achieves largely comparable results to BERT on ATIS and Snips, with under 100K parameters.
We also perform acceleration experiments on CPUs, where we observe our multi-task model predicts intents and slots nearly 63x faster than even DistilBERT.
arXiv Detail & Related papers (2020-06-05T21:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.