Related papers: Atleus: Accelerating Transformers on the Edge Enabled by 3D Heterogeneous Manycore Architectures

Atleus: Accelerating Transformers on the Edge Enabled by 3D Heterogeneous Manycore Architectures

URL: http://arxiv.org/abs/2501.09588v1
Date: Thu, 16 Jan 2025 15:11:33 GMT
Title: Atleus: Accelerating Transformers on the Edge Enabled by 3D Heterogeneous Manycore Architectures
Authors: Pratyush Dhingra, Janardhan Rao Doppa, Partha Pratim Pande,
Abstract summary: We propose the design of a 3D heterogeneous architecture referred to as Atleus.<n>Atleus incorporates heterogeneous computing resources specifically optimized to accelerate transformer models.<n>We show that Atleus outperforms existing state-of-the-art by up to 56x and 64.5x in terms of performance and energy efficiency respectively.
Score: 18.355570259898
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer architectures have become the standard neural network model for various machine learning applications including natural language processing and computer vision. However, the compute and memory requirements introduced by transformer models make them challenging to adopt for edge applications. Furthermore, fine-tuning pre-trained transformers (e.g., foundation models) is a common task to enhance the model's predictive performance on specific tasks/applications. Existing transformer accelerators are oblivious to complexities introduced by fine-tuning. In this paper, we propose the design of a three-dimensional (3D) heterogeneous architecture referred to as Atleus that incorporates heterogeneous computing resources specifically optimized to accelerate transformer models for the dual purposes of fine-tuning and inference. Specifically, Atleus utilizes non-volatile memory and systolic array for accelerating transformer computational kernels using an integrated 3D platform. Moreover, we design a suitable NoC to achieve high performance and energy efficiency. Finally, Atleus adopts an effective quantization scheme to support model compression. Experimental results demonstrate that Atleus outperforms existing state-of-the-art by up to 56x and 64.5x in terms of performance and energy efficiency respectively

Related papers

SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices [72.0937240883345]
Recent advances in diffusion transformers (DiTs) have set new standards in image generation, yet remain impractical for on-device deployment.<n>We present an efficient DiT framework tailored for mobile and edge devices that achieves transformer-level generation quality under strict resource constraints.
arXiv Detail & Related papers (2026-01-13T07:46:46Z)
CoFormer: Collaborating with Heterogeneous Edge Devices for Scalable Transformer Inference [34.693462786320545]
CoFormer is a collaborative inference system for general transformer models.<n>CoFormer enables the efficient inference of GPT2-XL with 1.6 billion parameters on edge devices, reducing memory requirements by 76.3%.
arXiv Detail & Related papers (2025-08-28T02:50:12Z)
Spiking Transformer Hardware Accelerators in 3D Integration [5.426379844893919]
Spiking neural networks (SNNs) are powerful models of computation and are well suited for on resource-constrained edge devices and neuromorphic hardware. Recently emerged showcased spiking transformers have promising performance and efficiency by capitalizing on the binary nature of spiking operations.
arXiv Detail & Related papers (2024-11-11T22:08:11Z)
HeTraX: Energy Efficient 3D Heterogeneous Manycore Architecture for Transformer Acceleration [18.355570259898]
We propose the design of a three-dimensional heterogeneous architecture referred to as HeTraX specifically optimized to accelerate transformer models. Experimental results show that HeTraX outperforms existing state-of-the-art by up to 5.6x in speedup and improves EDP by 14.5x while ensuring thermally feasibility.
arXiv Detail & Related papers (2024-08-06T18:48:01Z)
Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment [3.391499691517567]
Transformer models have revolutionized AI tasks, but their large size hinders real-world deployment on resource-constrained and latency-critical edge devices. We propose a co-design method for efficient end-to-end edge deployment of Transformers from three aspects: algorithm, hardware, and joint optimization. Experimental results show our co-design achieves up to 2.14-49.37x throughput gains and 3.72-88.53x better energy efficiency over state-of-the-art Transformer accelerators.
arXiv Detail & Related papers (2024-07-16T12:36:10Z)
TurboViT: Generating Fast Vision Transformers via Generative Architecture Search [74.24393546346974]
Vision transformers have shown unprecedented levels of performance in tackling various visual perception tasks in recent years. There has been significant research recently on the design of efficient vision transformer architecture. In this study, we explore the generation of fast vision transformer architecture designs via generative architecture search.
arXiv Detail & Related papers (2023-08-22T13:08:29Z)
TransCODE: Co-design of Transformers and Accelerators for Efficient Training and Inference [6.0093441900032465]
We propose a framework that simulates transformer inference and training on a design space of accelerators. We use this simulator in conjunction with the proposed co-design technique, called TransCODE, to obtain the best-performing models. The obtained transformer-accelerator pair achieves 0.3% higher accuracy than the state-of-the-art pair.
arXiv Detail & Related papers (2023-03-27T02:45:18Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications. The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate. There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z)
Vision Transformer Computation and Resilience for Dynamic Inference [3.6929360462568077]
We leverage the resilience of vision transformers to pruning and switch between different scaled versions of a model. Most FLOPs are generated by convolutions, not attention. Some models are fairly resilient and their model execution can be adapted without retraining.
arXiv Detail & Related papers (2022-12-06T01:10:31Z)
Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks. Although the network performance is boosted, transformers are often required more computational resources. We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z)
Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models. We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.