HAT: Hardware-Aware Transformers for Efficient Natural Language
Processing
- URL: http://arxiv.org/abs/2005.14187v1
- Date: Thu, 28 May 2020 17:58:56 GMT
- Title: HAT: Hardware-Aware Transformers for Efficient Natural Language
Processing
- Authors: Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang
Gan, Song Han
- Abstract summary: Hardware-Aware Transformers (HAT) are designed to enable low-latency inference on resource-constrained hardware platforms.
We train a $textitSuperTransformer$ that covers all candidates in the design space, and efficiently produces many $textitSubTransformers$ with weight sharing.
Experiments on four machine translation tasks demonstrate that HAT can discover efficient models for different hardware.
- Score: 78.48577649266018
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers are ubiquitous in Natural Language Processing (NLP) tasks, but
they are difficult to be deployed on hardware due to the intensive computation.
To enable low-latency inference on resource-constrained hardware platforms, we
propose to design Hardware-Aware Transformers (HAT) with neural architecture
search. We first construct a large design space with $\textit{arbitrary
encoder-decoder attention}$ and $\textit{heterogeneous layers}$. Then we train
a $\textit{SuperTransformer}$ that covers all candidates in the design space,
and efficiently produces many $\textit{SubTransformers}$ with weight sharing.
Finally, we perform an evolutionary search with a hardware latency constraint
to find a specialized $\textit{SubTransformer}$ dedicated to run fast on the
target hardware. Extensive experiments on four machine translation tasks
demonstrate that HAT can discover efficient models for different hardware (CPU,
GPU, IoT device). When running WMT'14 translation task on Raspberry Pi-4, HAT
can achieve $\textbf{3}\times$ speedup, $\textbf{3.7}\times$ smaller size over
baseline Transformer; $\textbf{2.7}\times$ speedup, $\textbf{3.6}\times$
smaller size over Evolved Transformer with $\textbf{12,041}\times$ less search
cost and no performance loss. HAT code is
https://github.com/mit-han-lab/hardware-aware-transformers.git
Related papers
- Can Transformers Learn $n$-gram Language Models? [77.35809823602307]
We study transformers' ability to learn random $n$-gram LMs of two kinds.
We find that classic estimation techniques for $n$-gram LMs such as add-$lambda$ smoothing outperform transformers.
arXiv Detail & Related papers (2024-10-03T21:21:02Z) - SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization [36.84275777364218]
This paper investigates the computational bottleneck modules of efficient transformer, i.e., normalization layers and attention modules.
LayerNorm is commonly used in transformer architectures but is not computational friendly due to statistic calculation during inference.
We propose a novel method named PRepBN to progressively replace LayerNorm with re- parameterized BatchNorm in training.
arXiv Detail & Related papers (2024-05-19T15:22:25Z) - Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer [5.141764719319689]
Vision Transformers (ViTs) have been rapidly developed and achieved remarkable performance in various computer vision tasks.
However, their huge model sizes and intensive computations hinder ViTs' deployment on embedded devices, calling for effective model compression methods, such as quantization.
We propose Trio-ViT, which eliminates the troublesome Softmax but also integrate linear attention with low computational complexity, and propose Trio-ViT accordingly.
arXiv Detail & Related papers (2024-05-06T21:57:35Z) - Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers [8.908747084128397]
We introduce the temporal counting logic $textbfK_textt$[#] alongside the RASP variant $textbfC-RASP$.
We show they are equivalent to each other, and that together they are the best-known lower bound on the formal expressivity of future-masked soft attention transformers.
arXiv Detail & Related papers (2024-04-05T20:36:30Z) - Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks.
This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z) - TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices [7.529632803434906]
TinyFormer is a framework designed to develop and deploy resource-efficient transformers on MCUs.
TinyFormer mainly consists of SuperNAS, SparseNAS and SparseEngine.
TinyFormer can develop efficient transformers with an accuracy of $96.1%$ while adhering to hardware constraints of $1$MB storage and $320$KB memory.
arXiv Detail & Related papers (2023-11-03T07:34:47Z) - Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$ [118.04625413322827]
$texttt5x$ and $texttseqio$ are open source software libraries for building and training language models.
These libraries have been used to train models with hundreds of billions of parameters on datasets with multiple terabytes of training data.
arXiv Detail & Related papers (2022-03-31T17:12:13Z) - Memory-Efficient Differentiable Transformer Architecture Search [59.47253706925725]
We propose a multi-split reversible network and combine it with DARTS.
Specifically, we devise a backpropagation-with-reconstruction algorithm so that we only need to store the last layer's outputs.
We evaluate the searched architecture on three sequence-to-sequence datasets, i.e., WMT'14 English-German, WMT'14 English-French, and WMT'14 English-Czech.
arXiv Detail & Related papers (2021-05-31T01:52:36Z) - Shallow-to-Deep Training for Neural Machine Translation [42.62107851930165]
In this paper, we investigate the behavior of a well-tuned deep Transformer system.
We find that stacking layers is helpful in improving the representation ability of NMT models.
This inspires us to develop a shallow-to-deep training method that learns deep models by stacking shallow models.
arXiv Detail & Related papers (2020-10-08T02:36:07Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.