X-Former: In-Memory Acceleration of Transformers
- URL: http://arxiv.org/abs/2303.07470v1
- Date: Mon, 13 Mar 2023 21:11:54 GMT
- Title: X-Former: In-Memory Acceleration of Transformers
- Authors: Shrihari Sridharan, Jacob R. Stevens, Kaushik Roy and Anand
Raghunathan
- Abstract summary: Transformers have achieved great success in a wide variety of natural language processing (NLP) tasks due to the attention mechanism.
Traditional deep neural network (DNN) accelerators face limitations in processing Transformers efficiently.
In-memory accelerators based on non-volatile memory promise to be an effective solution to this challenge.
We present X-Former, a hybrid in-memory hardware accelerator that consists of both NVM and CMOS processing elements.
- Score: 7.194491150684456
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have achieved great success in a wide variety of natural
language processing (NLP) tasks due to the attention mechanism, which assigns
an importance score for every word relative to other words in a sequence.
However, these models are very large, often reaching hundreds of billions of
parameters, and therefore require a large number of DRAM accesses. Hence,
traditional deep neural network (DNN) accelerators such as GPUs and TPUs face
limitations in processing Transformers efficiently. In-memory accelerators
based on non-volatile memory promise to be an effective solution to this
challenge, since they provide high storage density while performing massively
parallel matrix vector multiplications within memory arrays. However, attention
score computations, which are frequently used in Transformers (unlike CNNs and
RNNs), require matrix vector multiplications (MVM) where both operands change
dynamically for each input. As a result, conventional NVM-based accelerators
incur high write latency and write energy when used for Transformers, and
further suffer from the low endurance of most NVM technologies. To address
these challenges, we present X-Former, a hybrid in-memory hardware accelerator
that consists of both NVM and CMOS processing elements to execute transformer
workloads efficiently. To improve the hardware utilization of X-Former, we also
propose a sequence blocking dataflow, which overlaps the computations of the
two processing elements and reduces execution time. Across several benchmarks,
we show that X-Former achieves upto 85x and 7.5x improvements in latency and
energy over a NVIDIA GeForce GTX 1060 GPU and upto 10.7x and 4.6x improvements
in latency and energy over a state-of-the-art in-memory NVM accelerator.
Related papers
- FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs [0.0]
Transformer neural networks (TNNs) are being applied across a widening range of application domains, including natural language processing (NLP), machine translation, and computer vision (CV)
This paper proposes textitFAMOUS, a flexible hardware accelerator for dense multi-head attention computation of TNNs on field-programmable gate arrays (FPGAs)
It is optimized for high utilization of processing elements and on-chip memories to improve parallelism and reduce latency.
arXiv Detail & Related papers (2024-09-21T05:25:46Z) - ProTEA: Programmable Transformer Encoder Acceleration on FPGA [0.0]
Transformer neural networks (TNNs) have been widely utilized on a diverse range of applications, including natural language processing (NLP), machine translation, and computer vision (CV)
Despite the popularity of TNNs, there has been limited numbers of hardware accelerators targeting these two critical blocks.
This paper introduces textitProTEA, a programmable runtime accelerator tailored for the dense computations of state-of-the-art transformer encoders.
arXiv Detail & Related papers (2024-09-21T01:44:13Z) - Accelerator-driven Data Arrangement to Minimize Transformers Run-time on
Multi-core Architectures [5.46396577345121]
complexity of transformer models in artificial intelligence expands their computational costs, memory usage, and energy consumption.
We propose a novel memory arrangement strategy, governed by the hardware accelerator's kernel size, which effectively minimizes off-chip data access.
Our approach can achieve up to a 2.8x speed increase when executing inferences employing state-of-the-art transformers.
arXiv Detail & Related papers (2023-12-20T13:01:25Z) - Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices.
Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z) - Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs.
By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse
Transformers [11.811907838840712]
We propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns.
We present a flexible and efficient hardware architecture, namely STA, to achieve significant speedup when deploying N:M sparse Transformers.
Experimental results show that compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency.
arXiv Detail & Related papers (2022-08-12T04:51:49Z) - Block-Recurrent Transformers [49.07682696216708]
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence.
Our recurrent cell operates on blocks of tokens rather than single tokens, and leverages parallel computation within a block in order to make efficient use of accelerator hardware.
arXiv Detail & Related papers (2022-03-11T23:44:33Z) - Efficiency-driven Hardware Optimization for Adversarially Robust Neural
Networks [3.125321230840342]
We will focus on how to address adversarial robustness for Deep Neural Networks (DNNs) through efficiency-driven hardware optimizations.
One such approach is approximate digital CMOS memories with hybrid 6T-8T cells that enable supply scaling (Vdd) yielding low-power operation.
Another memory optimization approach involves the creation of memristive crossbars that perform Matrix-Multiplications (MVMs) efficiently with low energy and area requirements.
arXiv Detail & Related papers (2021-05-09T19:26:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.