BitParticle: Partializing Sparse Dual-Factors to Build Quasi-Synchronizing MAC Arrays for Energy-efficient DNNs
- URL: http://arxiv.org/abs/2507.09780v1
- Date: Sun, 13 Jul 2025 20:27:27 GMT
- Title: BitParticle: Partializing Sparse Dual-Factors to Build Quasi-Synchronizing MAC Arrays for Energy-efficient DNNs
- Authors: Feilong Qiaoyuan, Jihe Wang, Zhiyu Sun, Linying Wu, Yuanhua Xiao, Danghui Wang,
- Abstract summary: Bit-level sparsity in quantized deep neural networks (DNNs) offers significant potential for optimizing Multiply-Accumulate (MAC) operations.<n>But, two key challenges still limit its practical exploitation.<n>First, conventional bit-serial approaches cannot simultaneously leverage the sparsity of both factors.<n>Second, the fluctuation of bit-level sparsity leads to variable cycle counts for MAC operations.
- Score: 1.5079304866622987
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bit-level sparsity in quantized deep neural networks (DNNs) offers significant potential for optimizing Multiply-Accumulate (MAC) operations. However, two key challenges still limit its practical exploitation. First, conventional bit-serial approaches cannot simultaneously leverage the sparsity of both factors, leading to a complete waste of one factor' s sparsity. Methods designed to exploit dual-factor sparsity are still in the early stages of exploration, facing the challenge of partial product explosion. Second, the fluctuation of bit-level sparsity leads to variable cycle counts for MAC operations. Existing synchronous scheduling schemes that are suitable for dual-factor sparsity exhibit poor flexibility and still result in significant underutilization of MAC units. To address the first challenge, this study proposes a MAC unit that leverages dual-factor sparsity through the emerging particlization-based approach. The proposed design addresses the issue of partial product explosion through simple control logic, resulting in a more area- and energy-efficient MAC unit. In addition, by discarding less significant intermediate results, the design allows for further hardware simplification at the cost of minor accuracy loss. To address the second challenge, a quasi-synchronous scheme is introduced that adds cycle-level elasticity to the MAC array, reducing pipeline stalls and thereby improving MAC unit utilization. Evaluation results show that the exact version of the proposed MAC array architecture achieves a 29.2% improvement in area efficiency compared to the state-of-the-art bit-sparsity-driven architecture, while maintaining comparable energy efficiency. The approximate variant further improves energy efficiency by 7.5%, compared to the exact version. Index-Terms: DNN acceleration, Bit-level sparsity, MAC unit
Related papers
- Energy-Efficient Supervised Learning with a Binary Stochastic Forward-Forward Algorithm [0.0]
We derive forward-forward algorithms for binary, units.<n>We evaluate our proposed algorithms on the MNIST, Fashion-MNIST, and CIFAR-10 datasets.
arXiv Detail & Related papers (2025-07-09T00:29:06Z) - MAC: An Efficient Gradient Preconditioning using Mean Activation Approximated Curvature [7.512116180634991]
Second-order optimization methods for training neural networks, such as KFAC, exhibit superior convergence by utilizing curvature information of loss landscape.<n>We analyze the two components that constitute the layer-wise Fisher information matrix (FIM) used in KFAC: the Kronecker factors related to activations and pre-activation gradients.<n>We propose efficient approximations for them, resulting in a computationally efficient optimization method called MAC.<n>To the best of our knowledge, MAC is the first algorithm to apply the Kronecker factorization to the FIM of attention layers used in transformers and explicitly integrate attention scores into the preconditioning.
arXiv Detail & Related papers (2025-06-10T05:38:04Z) - R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs.<n> Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z) - DOMAC: Differentiable Optimization for High-Speed Multipliers and Multiply-Accumulators [25.876084896293058]
DOMAC is a novel approach that employs differentiable optimization for designing multipliers and MACs at specific technology nodes.<n>Building on this insight, DOMAC reformulates the discrete optimization challenge into a continuous problem by incorporating differentiable timing and area objectives.
arXiv Detail & Related papers (2025-03-31T10:49:05Z) - Joint Transmit and Pinching Beamforming for Pinching Antenna Systems (PASS): Optimization-Based or Learning-Based? [89.05848771674773]
A novel antenna system ()-enabled downlink multi-user multiple-input single-output (MISO) framework is proposed.<n>It consists of multiple waveguides, which equip numerous low-cost antennas, named (PAs)<n>The positions of PAs can be reconfigured to both spanning large-scale path and space.
arXiv Detail & Related papers (2025-02-12T18:54:10Z) - USEFUSE: Uniform Stride for Enhanced Performance in Fused Layer Architecture of Deep Neural Networks [0.6435156676256051]
This study presents the Sum-of-Products (SOP) units for convolution, which utilize low-latency left-to-right bit-serial arithmetic.<n>An effective mechanism detects and skips inefficient convolutions after ReLU layers, minimizing power consumption.<n>Two designs cater to varied demands: one focuses on minimal response time for mission-critical applications, and another focuses on resource-constrained devices with comparable latency.
arXiv Detail & Related papers (2024-12-18T11:04:58Z) - MixPE: Quantization and Hardware Co-design for Efficient LLM Inference [16.42907854119748]
MixPE is a specialized mixed-precision processing element designed for efficient low-bit quantization in large language models.
We show that MixPE surpasses the state-of-the-art quantization accelerators by $2.6times$ speedup and $1.4times$ energy reduction.
arXiv Detail & Related papers (2024-11-25T07:34:53Z) - DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing.
Existing ultra-low-bit quantization always causes severe accuracy drops.
We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - MAC-DO: An Efficient Output-Stationary GEMM Accelerator for CNNs Using
DRAM Technology [2.918940961856197]
This paper presents MAC-DO, an efficient and low-power DRAM-based in-situ accelerator.
It supports a multi-bit multiply-accumulate (MAC) operation within a single cycle.
A MAC-DO array efficiently can accelerate matrix multiplications based on output stationary mapping, supporting the majority of computations performed in deep neural networks (DNNs)
arXiv Detail & Related papers (2022-07-16T07:33:20Z) - Multiple Kernel Clustering with Dual Noise Minimization [56.009011016367744]
Multiple kernel clustering (MKC) aims to group data by integrating complementary information from base kernels.
In this paper, we rigorously define dual noise and propose a novel parameter-free MKC algorithm by minimizing them.
We observe that dual noise will pollute the block diagonal structures and incur the degeneration of clustering performance, and C-noise exhibits stronger destruction than N-noise.
arXiv Detail & Related papers (2022-07-13T08:37:42Z) - Learning Efficient GANs for Image Translation via Differentiable Masks
and co-Attention Distillation [130.30465659190773]
Generative Adversarial Networks (GANs) have been widely-used in image translation, but their high computation and storage costs impede the deployment on mobile devices.
We introduce a novel GAN compression method, termed DMAD, by proposing a Differentiable Mask and a co-Attention Distillation.
Experiments show DMAD can reduce the Multiply Accumulate Operations (MACs) of CycleGAN by 13x and that of Pix2Pix by 4x while retaining a comparable performance against the full model.
arXiv Detail & Related papers (2020-11-17T02:39:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.