Accelerating Framework of Transformer by Hardware Design and Model
Compression Co-Optimization
- URL: http://arxiv.org/abs/2110.10030v1
- Date: Tue, 19 Oct 2021 14:57:11 GMT
- Title: Accelerating Framework of Transformer by Hardware Design and Model
Compression Co-Optimization
- Authors: Panjie Qi, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Hongwu Peng, Shaoyi
Huang, Zhenglun Kong, Yuhong Song, and Bingbing Li
- Abstract summary: State-of-the-art Transformer-based models, with gigantic parameters, are difficult to be accommodated on resource constrained embedded devices.
We propose an algorithm & hardware closed-loop acceleration framework to address the deployment challenge of Transformer.
Our framework can achieve 37x, 1.9x, 1.7x speedup compared to CPU, GPU and FPGA, respectively.
- Score: 3.5862583389869487
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art Transformer-based models, with gigantic parameters, are
difficult to be accommodated on resource constrained embedded devices.
Moreover, with the development of technology, more and more embedded devices
are available to run a Transformer model. For a Transformer model with
different constraints (tight or loose), it can be deployed onto devices with
different computing power. However, in previous work, designers did not choose
the best device among multiple devices. Instead, they just used an existing
device to deploy model, which was not necessarily the best fit and may lead to
underutilization of resources. To address the deployment challenge of
Transformer and the problem to select the best device, we propose an algorithm
& hardware closed-loop acceleration framework. Given a dataset, a model,
latency constraint LC and accuracy constraint AC, our framework can provide a
best device satisfying both constraints. In order to generate a compressed
model with high sparsity ratio, we propose a novel pruning technique,
hierarchical pruning (HP). We optimize the sparse matrix storage format for HP
matrix to further reduce memory usage for FPGA implementation. We design a
accelerator that takes advantage of HP to solve the problem of concurrent
random access. Experiments on Transformer and TinyBert model show that our
framework can find different devices for various LC and AC, covering from
low-end devices to high-end devices. Our HP can achieve higher sparsity ratio
and is more flexible than other sparsity pattern. Our framework can achieve
37x, 1.9x, 1.7x speedup compared to CPU, GPU and FPGA, respectively.
Related papers
- SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices [72.0937240883345]
Recent advances in diffusion transformers (DiTs) have set new standards in image generation, yet remain impractical for on-device deployment.<n>We present an efficient DiT framework tailored for mobile and edge devices that achieves transformer-level generation quality under strict resource constraints.
arXiv Detail & Related papers (2026-01-13T07:46:46Z) - MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning [91.90342432541138]
Scaling up model size and training data has advanced foundation models for instance-level perception.<n>High computational cost limits adoption on resource-constrained platforms.<n>We introduce a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
arXiv Detail & Related papers (2025-10-16T18:00:00Z) - CoFormer: Collaborating with Heterogeneous Edge Devices for Scalable Transformer Inference [34.693462786320545]
CoFormer is a collaborative inference system for general transformer models.<n>CoFormer enables the efficient inference of GPT2-XL with 1.6 billion parameters on edge devices, reducing memory requirements by 76.3%.
arXiv Detail & Related papers (2025-08-28T02:50:12Z) - Taming Diffusion Transformer for Real-Time Mobile Video Generation [72.20660234882594]
Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones.<n>We propose a series of novel optimizations to significantly accelerate video generation and enable real-time performance on mobile platforms.
arXiv Detail & Related papers (2025-07-17T17:59:10Z) - SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving [7.91607650966469]
acronym is a framework that allows lightweight edge devices to draft multiple candidate tokens locally using diverse draft models.<n>A single, shared edge server verifies the tokens utilizing a more precise target model.<n>Our initial experiments with Jetson Orin Nano, Raspberry Pi 4B/5, and an edge server equipped with 4 Nvidia A100 GPUs indicate substantial benefits.
arXiv Detail & Related papers (2025-06-11T04:55:54Z) - Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment [3.391499691517567]
Transformer models have revolutionized AI tasks, but their large size hinders real-world deployment on resource-constrained and latency-critical edge devices.
We propose a co-design method for efficient end-to-end edge deployment of Transformers from three aspects: algorithm, hardware, and joint optimization.
Experimental results show our co-design achieves up to 2.14-49.37x throughput gains and 3.72-88.53x better energy efficiency over state-of-the-art Transformer accelerators.
arXiv Detail & Related papers (2024-07-16T12:36:10Z) - ResidualTransformer: Residual Low-Rank Learning with Weight-Sharing for
Transformer Layers [38.310917646404576]
Memory constraint of always-on devices is one of the major concerns when deploying speech processing models.
We propose an approach named ResidualTransformer, where each weight matrix in a Transformer layer comprises 1) a shared full-rank component with its adjacent layers, and 2) a unique low-rank component to itself.
Experiments of our 10k-hour speech recognition and speech translation tasks show that the Transformer encoder size can be reduced by 3X with very slight performance degradation.
arXiv Detail & Related papers (2023-10-03T23:31:48Z) - Practical Conformer: Optimizing size, speed and flops of Conformer for
on-Device and cloud ASR [67.63332492134332]
We design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs.
Our proposed encoder can double as a strong standalone encoder in on device, and as the first part of a high-performance ASR pipeline.
arXiv Detail & Related papers (2023-03-31T23:30:48Z) - TransCODE: Co-design of Transformers and Accelerators for Efficient
Training and Inference [6.0093441900032465]
We propose a framework that simulates transformer inference and training on a design space of accelerators.
We use this simulator in conjunction with the proposed co-design technique, called TransCODE, to obtain the best-performing models.
The obtained transformer-accelerator pair achieves 0.3% higher accuracy than the state-of-the-art pair.
arXiv Detail & Related papers (2023-03-27T02:45:18Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Reversible Vision Transformers [74.3500977090597]
Reversible Vision Transformers are a memory efficient architecture for visual recognition.
We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants.
We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
arXiv Detail & Related papers (2023-02-09T18:59:54Z) - Bilaterally Slimmable Transformer for Elastic and Efficient Visual
Question Answering [75.86788916930377]
bilaterally slimmable Transformer (BST) integrated into arbitrary Transformer-based VQA models.
One slimmed MCAN-BST submodel achieves comparable accuracy on VQA-v2.
Smallest MCAN-BST submodel has 9M parameters and 0.16G FLOPs during inference.
arXiv Detail & Related papers (2022-03-24T02:26:04Z) - LiteTransformerSearch: Training-free On-device Search for Efficient
Autoregressive Language Models [34.673688610935876]
We show that the latency and perplexity pareto-frontier can be found without need for any model training.
We evaluate our method, dubbed Lightweight Transformer Search (LTS), on diverse devices.
We show that the perplexity of Transformer-XL can be achieved with up to 2x lower latency.
arXiv Detail & Related papers (2022-03-04T02:10:43Z) - EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq
Generation [104.44478403427881]
EdgeFormer is a parameter-efficient Transformer of the encoder-decoder architecture for on-device seq2seq generation.
We conduct experiments on two practical on-device seq2seq tasks: Machine Translation and Grammatical Error Correction.
arXiv Detail & Related papers (2022-02-16T10:10:00Z) - Vis-TOP: Visual Transformer Overlay Processor [9.80151619872144]
Transformer has achieved good results in Natural Language Processing (NLP) and has also started to expand into Computer Vision (CV)
We propose Vis-TOP, an overlay processor for various visual Transformer models.
Vis-TOP summarizes the characteristics of all visual Transformer models and implements a three-layer and two-level transformation structure.
arXiv Detail & Related papers (2021-10-21T08:11:12Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.