BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination
- URL: http://arxiv.org/abs/2512.06457v1
- Date: Sat, 06 Dec 2025 14:44:38 GMT
- Title: BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination
- Authors: Huizheng Wang, Hongbin Wang, Shaojun Wei, Yang Hu, Shouyi Yin,
- Abstract summary: BitStopper is a fine-grained algorithm-architecture co-design that operates without a sparsity predictor.<n>It achieves 2.03x and 1.89x speedups over Sanger and SOFA, respectively, while delivering 2.4x and 2.1x improvements in energy efficiency.
- Score: 14.53308613746613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention-based large language models (LLMs) have transformed modern AI applications, but the quadratic cost of self-attention imposes significant compute and memory overhead. Dynamic sparsity (DS) attention mitigates this, yet its hardware efficiency is limited by the added prediction stage and the heavy memory traffic it entails. To address these limitations, this paper proposes BitStopper, a fine-grained algorithm-architecture co-design that operates without a sparsity predictor. First, a bit-serial enable stage fusion (BESF) mechanism is proposed to reuse and minimize the memory access by progressively terminating trivial tokens and merging the prediction stage into the execution stage. Second, a lightweight and adaptive token selection (LATS) strategy is developed to work in concert with the bit-level sparsity speculation. Third, a bit-level asynchronous processing (BAP) strategy is employed to improve compute utilization during the on-demand bit-grained memory fetching. Finally, an elaborate architecture is designed to translate the theoretical complexity reduction into practical performance improvement. Extensive evaluations demonstrate that, compared to state-of-the-art (SOTA) Transformer accelerators, BitStopper achieves 2.03x and 1.89x speedups over Sanger and SOFA, respectively, while delivering 2.4x and 2.1x improvements in energy efficiency.
Related papers
- ADEPT: Adaptive Dynamic Early-Exit Process for Transformers [12.23755727319088]
Early-exit strategies have proven effective in reducing computational demands by halting inference earlier.<n>We introduce ADEPT, a novel approach designed to overcome this issue and enable dynamic early exit in both the prefill and generation phases.<n>We show that ADEPT improves efficiency by up to 25% in language generation tasks and achieves a 4x speed-up in downstream classification tasks, with up to a 45% improvement in performance.
arXiv Detail & Related papers (2026-01-07T08:34:41Z) - LAPA: Log-Domain Prediction-Driven Dynamic Sparsity Accelerator for Transformer Model [14.53308613746613]
This paper proposes a log-domain attention prediction algorithm-architecture co-design, named LAPA.<n>Results show that LAPA achieves 3.52x, 3.24x and 2.79x higher energy efficiency than the state-of-the-art (SOTA) works Spatten, Sanger and FACT.
arXiv Detail & Related papers (2025-11-26T07:24:48Z) - Rethinking Autoregressive Models for Lossless Image Compression via Hierarchical Parallelism and Progressive Adaptation [75.58269386927076]
Autoregressive (AR) models are often dismissed as impractical due to prohibitive computational cost.<n>This work re-thinks this paradigm, introducing a framework built on hierarchical parallelism and progressive adaptation.<n> Experiments on diverse datasets (natural, satellite, medical) validate that our method achieves new state-of-the-art compression.
arXiv Detail & Related papers (2025-11-14T06:27:58Z) - Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing [8.705453442427585]
Large language models (LLMs) have achieved near-human performance across diverse reasoning tasks.<n>Their deployment on resource-constrained Internet-of-Things (IoT) devices remains impractical due to massive parameter footprints and memory-intensive autoregressive decoding.<n>This work introduces the first autoregressive-aware split computing framework designed explicitly for LLM deployment on edge devices.
arXiv Detail & Related papers (2025-11-06T02:55:07Z) - InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation [56.694702609077495]
Long-sequence processing is a critical capability for modern large language models.<n>InfLLM-V2 is a trainable sparse attention framework that seamlessly adapts models from short to long sequences.<n>In experiments, InfLLM-V2 is 4$times$ faster than dense attention while retaining 98.1% and 99.7% of the performance.
arXiv Detail & Related papers (2025-09-29T12:08:33Z) - Spark Transformer: Reactivating Sparsity in FFN and Attention [53.221448818147024]
We introduce Spark Transformer, a novel architecture that achieves a high level of activation sparsity in both FFN and the attention mechanism.<n>This sparsity translates to a 2.5x reduction in FLOPs, leading to decoding wall-time speedups of up to 1.79x on CPU and 1.40x on GPU.
arXiv Detail & Related papers (2025-06-07T03:51:13Z) - Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [64.62231094774211]
Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z) - Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders.
We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - Easy and Efficient Transformer : Scalable Inference Solution For large
NLP mode [14.321889138798072]
This paper introduces a series of ultra-large-scale pre-training model optimization methods.
An inference engine -- Easy and Efficient Transformer (EET) is proposed.
EET achieves a 1.5-15x state-of-art speedup varying with context length.
arXiv Detail & Related papers (2021-04-26T11:00:56Z) - Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration [14.958793135751149]
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM)
Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead.
We address a key architectural challenge with structural sparsity: how to provide support for a range of sparsity levels while maintaining high utilization of the hardware.
arXiv Detail & Related papers (2020-09-04T20:17:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.