Related papers: HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

URL: http://arxiv.org/abs/2005.14187v1
Date: Thu, 28 May 2020 17:58:56 GMT
Title: HAT: Hardware-Aware Transformers for Efficient Natural Language Processing
Authors: Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, Song Han
Abstract summary: Hardware-Aware Transformers (HAT) are designed to enable low-latency inference on resource-constrained hardware platforms. We train a $textitSuperTransformer$ that covers all candidates in the design space, and efficiently produces many $textitSubTransformers$ with weight sharing. Experiments on four machine translation tasks demonstrate that HAT can discover efficient models for different hardware.
Score: 78.48577649266018
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers are ubiquitous in Natural Language Processing (NLP) tasks, but they are difficult to be deployed on hardware due to the intensive computation. To enable low-latency inference on resource-constrained hardware platforms, we propose to design Hardware-Aware Transformers (HAT) with neural architecture search. We first construct a large design space with $\textit{arbitrary encoder-decoder attention}$ and $\textit{heterogeneous layers}$. Then we train a $\textit{SuperTransformer}$ that covers all candidates in the design space, and efficiently produces many $\textit{SubTransformers}$ with weight sharing. Finally, we perform an evolutionary search with a hardware latency constraint to find a specialized $\textit{SubTransformer}$ dedicated to run fast on the target hardware. Extensive experiments on four machine translation tasks demonstrate that HAT can discover efficient models for different hardware (CPU, GPU, IoT device). When running WMT'14 translation task on Raspberry Pi-4, HAT can achieve $\textbf{3}\times$ speedup, $\textbf{3.7}\times$ smaller size over baseline Transformer; $\textbf{2.7}\times$ speedup, $\textbf{3.6}\times$ smaller size over Evolved Transformer with $\textbf{12,041}\times$ less search cost and no performance loss. HAT code is https://github.com/mit-han-lab/hardware-aware-transformers.git

Related papers

Theoretical limitations of multi-layer Transformer [14.63344366356708]
We prove the first $textitunconditional$ lower bound against multi-layer decoder-only transformers. We also introduce a new proof technique that finds a certain $textitindistinguishable$ $textitde$ all possible inputs. We believe our new communication model and proof technique will be helpful to further understand the computational power of transformers.
arXiv Detail & Related papers (2024-12-04T02:37:31Z)
Circuit Complexity Bounds for RoPE-based Transformer Architecture [25.2590541420499]
Empirical evidence suggests that $mathsfRoPE$-based Transformer architectures demonstrate greater generalization capabilities than conventional Transformer models.<n>We show that unless $mathsfTC0 = mathsfNC1$, a $mathsfRoPE$-based Transformer with $mathrmpoly(n)$-precision, $O(1)$ layers, hidden dimension $d leq O(n)$ cannot solve the Arithmetic formula evaluation problem.
arXiv Detail & Related papers (2024-11-12T07:24:41Z)
Can Transformers Learn $n$-gram Language Models? [77.35809823602307]
We study transformers' ability to learn random $n$-gram LMs of two kinds. We find that classic estimation techniques for $n$-gram LMs such as add-$lambda$ smoothing outperform transformers.
arXiv Detail & Related papers (2024-10-03T21:21:02Z)
SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization [36.84275777364218]
This paper investigates the computational bottleneck modules of efficient transformer, i.e., normalization layers and attention modules. LayerNorm is commonly used in transformer architectures but is not computational friendly due to statistic calculation during inference. We propose a novel method named PRepBN to progressively replace LayerNorm with re- parameterized BatchNorm in training.
arXiv Detail & Related papers (2024-05-19T15:22:25Z)
Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer [5.141764719319689]
Vision Transformers (ViTs) have been rapidly developed and achieved remarkable performance in various computer vision tasks. However, their huge model sizes and intensive computations hinder ViTs' deployment on embedded devices, calling for effective model compression methods, such as quantization. We propose Trio-ViT, which eliminates the troublesome Softmax but also integrate linear attention with low computational complexity, and propose Trio-ViT accordingly.
arXiv Detail & Related papers (2024-05-06T21:57:35Z)
Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers [8.908747084128397]
We introduce the temporal counting logic $textbfK_textt$[#] alongside the RASP variant $textbfC-RASP$. We show they are equivalent to each other, and that together they are the best-known lower bound on the formal expressivity of future-masked soft attention transformers.
arXiv Detail & Related papers (2024-04-05T20:36:30Z)
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z)
TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices [7.529632803434906]
TinyFormer is a framework designed to develop and deploy resource-efficient transformers on MCUs. TinyFormer mainly consists of SuperNAS, SparseNAS and SparseEngine. TinyFormer can develop efficient transformers with an accuracy of $96.1%$ while adhering to hardware constraints of $1$MB storage and $320$KB memory.
arXiv Detail & Related papers (2023-11-03T07:34:47Z)
SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers [11.631442682756203]
Quantized Transformers' compute-intensive operations pose enormous challenges for their deployment in resource-constrained EdgeAI / tinyML devices. We propose SwiftTron, an efficient specialized hardware accelerator designed for Quantized Transformers. Our Accelerator executes the RoBERTa-base model in 1.83 ns, while consuming 33.64 mW power, and occupying an area of 273 mm2.
arXiv Detail & Related papers (2023-04-08T11:17:51Z)
What Dense Graph Do You Need for Self-Attention? [73.82686008622596]
We present Hypercube Transformer, a sparse Transformer that models token interactions in a hypercube and shows comparable or even better results with vanilla Transformer. Experiments on tasks requiring various sequence lengths lay validation for our graph function well.
arXiv Detail & Related papers (2022-05-27T14:36:55Z)
Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$ [118.04625413322827]
$texttt5x$ and $texttseqio$ are open source software libraries for building and training language models. These libraries have been used to train models with hundreds of billions of parameters on datasets with multiple terabytes of training data.
arXiv Detail & Related papers (2022-03-31T17:12:13Z)
Memory-Efficient Differentiable Transformer Architecture Search [59.47253706925725]
We propose a multi-split reversible network and combine it with DARTS. Specifically, we devise a backpropagation-with-reconstruction algorithm so that we only need to store the last layer's outputs. We evaluate the searched architecture on three sequence-to-sequence datasets, i.e., WMT'14 English-German, WMT'14 English-French, and WMT'14 English-Czech.
arXiv Detail & Related papers (2021-05-31T01:52:36Z)
Shallow-to-Deep Training for Neural Machine Translation [42.62107851930165]
In this paper, we investigate the behavior of a well-tuned deep Transformer system. We find that stacking layers is helpful in improving the representation ability of NMT models. This inspires us to develop a shallow-to-deep training method that learns deep models by stacking shallow models.
arXiv Detail & Related papers (2020-10-08T02:36:07Z)
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one. With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.