Optimizing Inference Performance of Transformers on CPUs
- URL: http://arxiv.org/abs/2102.06621v1
- Date: Fri, 12 Feb 2021 17:01:35 GMT
- Title: Optimizing Inference Performance of Transformers on CPUs
- Authors: Dave Dice and Alex Kogan
- Abstract summary: Transformers-based models (e.g., BERT) power many important Web services, such as search, translation, question-answering, etc.
This paper presents an empirical analysis of scalability and performance of inferencing a Transformer-based model on CPUs.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The Transformer architecture revolutionized the field of natural language
processing (NLP). Transformers-based models (e.g., BERT) power many important
Web services, such as search, translation, question-answering, etc. While
enormous research attention is paid to the training of those models, relatively
little efforts are made to improve their inference performance. This paper
comes to address this gap by presenting an empirical analysis of scalability
and performance of inferencing a Transformer-based model on CPUs. Focusing on
the highly popular BERT model, we identify key components of the Transformer
architecture where the bulk of the computation happens, and propose three
optimizations to speed them up. The optimizations are evaluated using the
inference benchmark from HuggingFace, and are shown to achieve the speedup of
up to x2.36. The considered optimizations do not require any changes to the
implementation of the models nor affect their accuracy.
Related papers
- Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis [16.253898272659242]
This study focuses on Transformer-based LLMs, specifically applying low-rank parametrization to feedforward networks (FFNs)
Experiments on the large RefinedWeb dataset show that low-rank parametrization is both efficient (e.g., 2.6$times$ FFN speed-up with 32% parameters) and effective during training.
Motivated by this finding, we develop the wide and structured networks surpassing the current medium-sized and large-sized Transformer in perplexity and throughput performance.
arXiv Detail & Related papers (2024-07-13T10:08:55Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Benchmarking and In-depth Performance Study of Large Language Models on
Habana Gaudi Processors [5.432613942292548]
Transformer models have achieved remarkable success in various machine learning tasks but suffer from high computational complexity and resource requirements.
Specialized AI hardware accelerators, such as the Habana GAUDI architecture, offer a promising solution to tackle these issues.
This paper explores the untapped potential of using GAUDI processors to accelerate Transformer-based models, addressing key challenges in the process.
arXiv Detail & Related papers (2023-09-29T04:49:35Z) - Optimizing Non-Autoregressive Transformers with Contrastive Learning [74.46714706658517]
Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order.
In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution.
arXiv Detail & Related papers (2023-05-23T04:20:13Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications.
The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate.
There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z) - FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization [73.41395947275473]
We propose a novel frequency-aware architecture, in which the domain-specific features are filtered out in the transformed frequency domain.
Experiments on three benchmarks demonstrate significant performance, outperforming the state-of-the-art methods by a margin of 3%, 4% and 9%, respectively.
arXiv Detail & Related papers (2022-03-24T07:26:29Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z) - Transition-based Parsing with Stack-Transformers [32.029528327212795]
Recurrent Neural Networks considerably improved the performance of transition-based systems by modelling the global state.
We show that modifications of the cross attention mechanism of the Transformer considerably strengthen performance both on dependency and Meaning.
arXiv Detail & Related papers (2020-10-20T23:20:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.