TransCODE: Co-design of Transformers and Accelerators for Efficient
Training and Inference
- URL: http://arxiv.org/abs/2303.14882v1
- Date: Mon, 27 Mar 2023 02:45:18 GMT
- Title: TransCODE: Co-design of Transformers and Accelerators for Efficient
Training and Inference
- Authors: Shikhar Tuli and Niraj K. Jha
- Abstract summary: We propose a framework that simulates transformer inference and training on a design space of accelerators.
We use this simulator in conjunction with the proposed co-design technique, called TransCODE, to obtain the best-performing models.
The obtained transformer-accelerator pair achieves 0.3% higher accuracy than the state-of-the-art pair.
- Score: 6.0093441900032465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated co-design of machine learning models and evaluation hardware is
critical for efficiently deploying such models at scale. Despite the
state-of-the-art performance of transformer models, they are not yet ready for
execution on resource-constrained hardware platforms. High memory requirements
and low parallelizability of the transformer architecture exacerbate this
problem. Recently-proposed accelerators attempt to optimize the throughput and
energy consumption of transformer models. However, such works are either
limited to a one-sided search of the model architecture or a restricted set of
off-the-shelf devices. Furthermore, previous works only accelerate model
inference and not training, which incurs substantially higher memory and
compute resources, making the problem even more challenging. To address these
limitations, this work proposes a dynamic training framework, called DynaProp,
that speeds up the training process and reduces memory consumption. DynaProp is
a low-overhead pruning method that prunes activations and gradients at runtime.
To effectively execute this method on hardware for a diverse set of transformer
architectures, we propose ELECTOR, a framework that simulates transformer
inference and training on a design space of accelerators. We use this simulator
in conjunction with the proposed co-design technique, called TransCODE, to
obtain the best-performing models with high accuracy on the given task and
minimize latency, energy consumption, and chip area. The obtained
transformer-accelerator pair achieves 0.3% higher accuracy than the
state-of-the-art pair while incurring 5.2$\times$ lower latency and 3.0$\times$
lower energy consumption.
Related papers
- Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders.
We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z) - HeTraX: Energy Efficient 3D Heterogeneous Manycore Architecture for Transformer Acceleration [18.355570259898]
We propose the design of a three-dimensional heterogeneous architecture referred to as HeTraX specifically optimized to accelerate transformer models.
Experimental results show that HeTraX outperforms existing state-of-the-art by up to 5.6x in speedup and improves EDP by 14.5x while ensuring thermally feasibility.
arXiv Detail & Related papers (2024-08-06T18:48:01Z) - Accelerating ViT Inference on FPGA through Static and Dynamic Pruning [2.8595179027282907]
Vision Transformers (ViTs) have achieved state-of-the-art accuracy on various computer vision tasks.
Weight and token pruning are two well-known methods for reducing complexity.
We propose an algorithm-hardware codesign for accelerating ViT on FPGA through simultaneous pruning.
arXiv Detail & Related papers (2024-03-21T00:09:04Z) - Accelerator-driven Data Arrangement to Minimize Transformers Run-time on
Multi-core Architectures [5.46396577345121]
complexity of transformer models in artificial intelligence expands their computational costs, memory usage, and energy consumption.
We propose a novel memory arrangement strategy, governed by the hardware accelerator's kernel size, which effectively minimizes off-chip data access.
Our approach can achieve up to a 2.8x speed increase when executing inferences employing state-of-the-art transformers.
arXiv Detail & Related papers (2023-12-20T13:01:25Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference with
Transformers [6.0093441900032465]
Self-attention-based transformer models have achieved tremendous success in the domain of natural language processing.
Previous works directly operate on large matrices involved in the attention operation, which limits hardware utilization.
We propose a novel dynamic inference scheme, DynaTran, which prunes activations at runtime with low overhead.
arXiv Detail & Related papers (2023-02-28T16:17:23Z) - Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications.
The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate.
There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse
Transformers [11.811907838840712]
We propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns.
We present a flexible and efficient hardware architecture, namely STA, to achieve significant speedup when deploying N:M sparse Transformers.
Experimental results show that compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency.
arXiv Detail & Related papers (2022-08-12T04:51:49Z) - Data-Driven Offline Optimization For Architecting Hardware Accelerators [89.68870139177785]
We develop a data-driven offline optimization method for designing hardware accelerators, dubbed PRIME.
PRIME improves performance upon state-of-the-art simulation-driven methods by about 1.54x and 1.20x, while considerably reducing the required total simulation time by 93% and 99%, respectively.
In addition, PRIME also architects effective accelerators for unseen applications in a zero-shot setting, outperforming simulation-based methods by 1.26x.
arXiv Detail & Related papers (2021-10-20T17:06:09Z) - Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models.
We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.