Related papers: QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations

QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations

URL: http://arxiv.org/abs/2511.06767v1
Date: Mon, 10 Nov 2025 06:46:21 GMT
Title: QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations
Authors: Zhixiong Zhao, Haomin Li, Fangxin Liu, Yuncheng Lu, Zongwu Wang, Tao Yang, Li Jiang, Haibing Guan,
Abstract summary: QUARK is a quantization-enabled FPGA acceleration framework.<n>It targets all nonlinear operations within Transformer-based models.<n>It achieves high-performance approximation through a novel circuit-sharing design.
Score: 16.476647190730876
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based models have revolutionized computer vision (CV) and natural language processing (NLP) by achieving state-of-the-art performance across a range of benchmarks. However, nonlinear operations in models significantly contribute to inference latency, presenting unique challenges for efficient hardware acceleration. To this end, we propose QUARK, a quantization-enabled FPGA acceleration framework that leverages common patterns in nonlinear operations to enable efficient circuit sharing, thereby reducing hardware resource requirements. QUARK targets all nonlinear operations within Transformer-based models, achieving high-performance approximation through a novel circuit-sharing design tailored to accelerate these operations. Our evaluation demonstrates that QUARK significantly reduces the computational overhead of nonlinear operators in mainstream Transformer architectures, achieving up to a 1.96 times end-to-end speedup over GPU implementations. Moreover, QUARK lowers the hardware overhead of nonlinear modules by more than 50% compared to prior approaches, all while maintaining high model accuracy -- and even substantially boosting accuracy under ultra-low-bit quantization.

Related papers

POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation [57.57816409869894]
We introduce POET-X, a scalable and memory-efficient variant for training large language models.<n>PoET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency.
arXiv Detail & Related papers (2026-03-05T18:59:23Z)
TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design [4.251127523213289]
TriGen is a novel NPU architecture tailored for resource-constrained environments through software-hardware co-design.<n>We show that TriGen achieves an average 2.73x performance speedup and 52% less memory transfer over the baseline NPU design with negligible accuracy loss.
arXiv Detail & Related papers (2026-02-13T14:28:31Z)
Design and Implementation of an FPGA-Based Hardware Accelerator for Transformer [0.0]
Transformer-based large language models rely heavily on matrix multiplications for attention and feed-forward layers.<n>We introduce a highly optimized tiled matrix multiplication accelerator on a resource-constrained Xilinx KV260 FPGA.<n>Our design exploits persistent on-chip storage, a robust two-level tiling strategy for maximal data reuse, and a systolic-like unrolled compute engine.
arXiv Detail & Related papers (2025-03-20T22:15:42Z)
Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders. We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z)
Dynamic Range Reduction via Branch-and-Bound [1.0141085397402314]
Key strategy to enhance hardware accelerators is the reduction of precision in arithmetic operations.<n>This paper introduces a fully principled Branch-and-Bound algorithm for reducing precision needs in QUBO problems.<n> Experiments validate our algorithm's effectiveness on an actual quantum annealer.
arXiv Detail & Related papers (2024-09-17T03:07:56Z)
Enhancing Dropout-based Bayesian Neural Networks with Multi-Exit on FPGA [20.629635991749808]
This paper proposes an algorithm and hardware co-design framework that can generate field-programmable gate array (FPGA)-based accelerators for efficient BayesNNs. At the algorithm level, we propose novel multi-exit dropout-based BayesNNs with reduced computational and memory overheads. At the hardware level, this paper introduces a transformation framework that can generate FPGA-based accelerators for the proposed efficient BayesNNs.
arXiv Detail & Related papers (2024-06-20T17:08:42Z)
ADC/DAC-Free Analog Acceleration of Deep Neural Networks with Frequency Transformation [2.7488316163114823]
This paper proposes a novel approach to an energy-efficient acceleration of frequency-domain neural networks by utilizing analog-domain frequency-based tensor transformations. Our approach achieves more compact cells by eliminating the need for trainable parameters in the transformation matrix. On a 16$times$16 crossbars, for 8-bit input processing, the proposed approach achieves the energy efficiency of 1602 tera operations per second per Watt.
arXiv Detail & Related papers (2023-09-04T19:19:39Z)
RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z)
Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications. The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate. There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z)
LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics [45.666822327616046]
This work presents a novel reconfigurable architecture for Low Graph Neural Network (LL-GNN) designs for particle detectors. The LL-GNN design advances the next generation of trigger systems by enabling sophisticated algorithms to process experimental data efficiently.
arXiv Detail & Related papers (2022-09-28T12:55:35Z)
GPU-Accelerated Machine Learning in Non-Orthogonal Multiple Access [71.58925117604039]
Non-orthogonal multiple access (NOMA) is an interesting technology that enables massive connectivity as required in future 5G and 6G networks. We propose a neural network architecture that combines the advantages of both linear and non-linear processing.
arXiv Detail & Related papers (2022-06-13T09:38:23Z)
FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task. The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources. It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z)
NN-LUT: Neural Approximation of Non-Linear Operations for Efficient Transformer Inference [9.329021390526124]
Non-linear operations such as GELU, Layer normalization, and Softmax are essential yet costly building blocks of Transformer models. This paper proposes an accurate and hardware-friendly approximation framework for efficient Transformer inference.
arXiv Detail & Related papers (2021-12-03T23:06:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.