Related papers: COBRA: Algorithm-Architecture Co-optimized Binary Transformer Accelerator for Edge Inference

COBRA: Algorithm-Architecture Co-optimized Binary Transformer Accelerator for Edge Inference

URL: http://arxiv.org/abs/2504.16269v2
Date: Thu, 24 Apr 2025 18:13:19 GMT
Title: COBRA: Algorithm-Architecture Co-optimized Binary Transformer Accelerator for Edge Inference
Authors: Ye Qiao, Zhiheng Chen, Yian Wang, Yifan Zhang, Yunzhe Deng, Sitao Huang,
Abstract summary: We introduce COBRA, an algorithm-architecture co-optimized binary Transformer accelerator for edge computing.<n> COBRA features a real 1-bit binary multiplication unit, enabling matrix operations with -1, 0, and +1 values, surpassing ternary methods.<n>With further hardware-friendly optimizations in the attention block, COBRA achieves up to 3,894.7 GOPS throughput and 448.7 GOPS/Watt energy efficiency on edge FPGAs.
Score: 5.55799327417722
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based models have demonstrated superior performance in various fields, including natural language processing and computer vision. However, their enormous model size and high demands in computation, memory, and communication limit their deployment to edge platforms for local, secure inference. Binary transformers offer a compact, low-complexity solution for edge deployment with reduced bandwidth needs and acceptable accuracy. However, existing binary transformers perform inefficiently on current hardware due to the lack of binary specific optimizations. To address this, we introduce COBRA, an algorithm-architecture co-optimized binary Transformer accelerator for edge computing. COBRA features a real 1-bit binary multiplication unit, enabling matrix operations with -1, 0, and +1 values, surpassing ternary methods. With further hardware-friendly optimizations in the attention block, COBRA achieves up to 3,894.7 GOPS throughput and 448.7 GOPS/Watt energy efficiency on edge FPGAs, delivering a 311x energy efficiency improvement over GPUs and a 3.5x throughput improvement over the state-of-the-art binary accelerator, with only negligible inference accuracy degradation.

Related papers

An ultra-low-power CGRA for accelerating Transformers at the edge [1.52292571922932]
This paper introduces an ultra-low-power, Coarse-Grained Reconfigurable Array (CGRA) architecture to accelerate General Matrix multiplication (GEMM) operations in transformer models.<n>The proposed architecture integrates a 4 x 4 array of Processing Elements (PEs) for efficient parallel computation and dedicated 4 x 2 Memory Operation Blocks (MOBs) for optimized LOAD/STORE operations.<n>A switchless mesh torus interconnect network further minimizes power and latency by enabling direct communication between PEs and MOBs.
arXiv Detail & Related papers (2025-07-17T08:43:14Z)
BTC-LLM: Efficient Sub-1-Bit LLM Quantization via Learnable Transformation and Binary Codebook [20.89001326838199]
We present BTC-LLM, a novel sub-1-bit large language model (LLM) quantization framework.<n>Our approach incorporates two key innovations: (1) a Learnable Transformation that optimize invertible scaling and rotation to align binarized weights with full-precision distributions, and (2) a Flash and Accurate Binary Codebook that identifies recurring binary vector clusters.
arXiv Detail & Related papers (2025-05-24T03:57:19Z)
BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN)<n>We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations.<n>Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z)
Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale [68.6602625868888]
We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations.<n>Operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression.<n>We train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids.
arXiv Detail & Related papers (2025-02-25T19:47:20Z)
Leveraging ASIC AI Chips for Homomorphic Encryption [12.209134343914537]
homomorphic encryption (HE) offers strong privacy guarantee, but it requires substantially more resources than computing on plaintext.<n> accelerators have emerged to mitigate this latency issue, but with the high cost of ASICs.<n>We show that HE primitives can be converted to AI operators and accelerated on existing ASIC AI accelerators, like TPUs, which are already widely deployed in the cloud.
arXiv Detail & Related papers (2025-01-13T04:08:14Z)
Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders. We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z)
Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment [3.391499691517567]
Transformer models have revolutionized AI tasks, but their large size hinders real-world deployment on resource-constrained and latency-critical edge devices. We propose a co-design method for efficient end-to-end edge deployment of Transformers from three aspects: algorithm, hardware, and joint optimization. Experimental results show our co-design achieves up to 2.14-49.37x throughput gains and 3.72-88.53x better energy efficiency over state-of-the-art Transformer accelerators.
arXiv Detail & Related papers (2024-07-16T12:36:10Z)
BDC-Occ: Binarized Deep Convolution Unit For Binarized Occupancy Network [55.21288428359509]
Existing 3D occupancy networks demand significant hardware resources, hindering the deployment of edge devices. We propose a novel binarized deep convolution (BDC) unit that effectively enhances performance while increasing the number of binarized convolutional layers. Our BDC-Occ model is created by applying the proposed BDC unit to binarize the existing 3D occupancy networks.
arXiv Detail & Related papers (2024-05-27T10:44:05Z)
CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation. Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens. Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z)
Practical Conformer: Optimizing size, speed and flops of Conformer for on-Device and cloud ASR [67.63332492134332]
We design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs. Our proposed encoder can double as a strong standalone encoder in on device, and as the first part of a high-performance ASR pipeline.
arXiv Detail & Related papers (2023-03-31T23:30:48Z)
HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions. We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z)
An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers [11.811907838840712]
We propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns. We present a flexible and efficient hardware architecture, namely STA, to achieve significant speedup when deploying N:M sparse Transformers. Experimental results show that compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency.
arXiv Detail & Related papers (2022-08-12T04:51:49Z)
Easy and Efficient Transformer : Scalable Inference Solution For large NLP mode [14.321889138798072]
This paper introduces a series of ultra-large-scale pre-training model optimization methods. An inference engine -- Easy and Efficient Transformer (EET) is proposed. EET achieves a 1.5-15x state-of-art speedup varying with context length.
arXiv Detail & Related papers (2021-04-26T11:00:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.