Related papers: IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

URL: http://arxiv.org/abs/2511.21513v1
Date: Wed, 26 Nov 2025 15:46:22 GMT
Title: IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference
Authors: Wanli Zhong, Haibo Feng, Zirui Zhou, Hanyang Peng, Shiqi Yu,
Abstract summary: We present IntAttention, the first fully integer, plug-and-play attention pipeline without retraining.<n>IntAttention integrates sparsity-aware clipping, a 32-entry lookup-table approximation, and direct integer normalization.<n>Our method achieves up to 3.7x speedup and 61% energy reduction over FP16 baselines and 2.0x faster than conventional INT8 attention pipelines on Armv8 CPUs.
Score: 11.526305104815357
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deploying Transformer models on edge devices is limited by latency and energy budgets. While INT8 quantization effectively accelerates the primary matrix multiplications, it exposes the softmax as the dominant bottleneck. This stage incurs a costly dequantize-softmax-requantize detour, which can account for up to 65% of total attention latency and disrupts the end-to-end integer dataflow critical for edge hardware efficiency. To address this limitation, we present IntAttention, the first fully integer, plug-and-play attention pipeline without retraining. At the core of our approach lies IndexSoftmax, a hardware-friendly operator that replaces floating-point exponentials entirely within the integer domain. IntAttention integrates sparsity-aware clipping, a 32-entry lookup-table approximation, and direct integer normalization, thereby eliminating all datatype conversion overhead. We evaluate IntAttention and demonstrate consistent and substantial gains. Our method achieves up to 3.7x speedup and 61% energy reduction over FP16 baselines and 2.0x faster than conventional INT8 attention pipelines on Armv8 CPUs. These gains are achieved with high-fidelity accuracy comparable to baselines across diverse language and vision models, enabling practical and efficient Transformer inference on commodity edge devices. Code will be released in later version of this work.

Related papers

BAPS: A Fine-Grained Low-Precision Scheme for Softmax in Attention via Block-Aware Precision reScaling [12.43240392025487]
We introduce a novel low-precision workflow that employs a specific 8-bit floating-point format (HiF8) and block-aware precision rescaling for softmax.<n>Our algorithmic innovations make low-precision softmax feasible without the significant model accuracy loss.<n>Our work paves the way for doubling end-to-end inference throughput without increasing chip area.
arXiv Detail & Related papers (2026-02-02T13:12:18Z)
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats [51.72056104795248]
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats.<n>This paper systematically investigates the trade-offs between FP and integer (INT) formats.<n>We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced.
arXiv Detail & Related papers (2025-10-29T15:11:53Z)
FLASH-D: FlashAttention with Hidden Softmax Division [3.668018928502405]
Building on online softmax computation, FlashAttention integrates softmax calculation with matrix arithmetic.<n>This work presents FLASH-D a mathematically equivalent, yet simplified, formulation that achieves: (a) hiding softmax division within other non-linear function evaluations; (b) inherently numerically stable computation of exponentials; and (c) a reduction in computational cost without introducing numerical approximations to the FlashAttention kernel.
arXiv Detail & Related papers (2025-05-20T11:01:33Z)
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction [52.14200610448542]
A transformer has a quadratic complexity, leading to high inference costs and latency for long sequences.<n>We propose a simple, novel, and effective procedure for correcting this distributional shift.<n>Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
arXiv Detail & Related papers (2025-05-16T13:48:33Z)
VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers [13.984340807378457]
Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step.<n>We design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph's method.<n>We execute Softmax with 162.7$times$ less latency and 74.3$times$ less energy compared to the baseline cluster.
arXiv Detail & Related papers (2025-04-15T14:28:48Z)
FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop. In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers. A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z)
I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference [3.067607520161916]
Vision Transformers (ViTs) have achieved state-of-the-art performance on various computer vision applications. These models have considerable storage and computational overheads, making their deployment and efficient inference on edge devices challenging. We propose I-ViT, an integer-only quantization scheme for ViTs, to enable ViTs to perform the entire computational graph of inference with integer arithmetic and bit-shifting.
arXiv Detail & Related papers (2022-07-04T13:37:38Z)
I-BERT: Integer-only BERT Quantization [78.43819756382103]
We propose I-BERT, a novel quantization scheme for Transformer based models. I-BERT performs an end-to-end integer-only BERT inference without any floating point calculation. We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline.
arXiv Detail & Related papers (2021-01-05T02:42:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.