Related papers: SWIS -- Shared Weight bIt Sparsity for Efficient Neural Network Acceleration

SWIS -- Shared Weight bIt Sparsity for Efficient Neural Network Acceleration

URL: http://arxiv.org/abs/2103.01308v2
Date: Wed, 3 Mar 2021 03:35:58 GMT
Title: SWIS -- Shared Weight bIt Sparsity for Efficient Neural Network Acceleration
Authors: Shurui Li, Wojciech Romaszkan, Alexander Graening, Puneet Gupta
Abstract summary: Quantization is spearheading the increase in performance and efficiency of neural network computing systems. We present SWIS - Shared Weight bIt Sparsity, a quantization framework for efficient neural network inference acceleration.
Score: 68.36996813591423
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quantization is spearheading the increase in performance and efficiency of neural network computing systems making headway into commodity hardware. We present SWIS - Shared Weight bIt Sparsity, a quantization framework for efficient neural network inference acceleration delivering improved performance and storage compression through an offline weight decomposition and scheduling algorithm. SWIS can achieve up to 54.3% (19.8%) point accuracy improvement compared to weight truncation when quantizing MobileNet-v2 to 4 (2) bits post-training (with retraining) showing the strength of leveraging shared bit-sparsity in weights. SWIS accelerator gives up to 6x speedup and 1.9x energy improvement overstate of the art bit-serial architectures.

Related papers

Binary Weight Multi-Bit Activation Quantization for Compute-in-Memory CNN Accelerators [19.034502382765755]
We introduce a novel binary weight multi-bit activation (BWMA) method for CNNs on CIM-based accelerators.<n>Our contributions include deriving closed-form solutions for weight quantization in each layer, significantly improving the representational capabilities of binarized weights.<n>We show that BWMA achieves notable accuracy improvements over existing methods, registering gains of 1.44%-5.46% and 0.35%-5.37% on respective datasets.
arXiv Detail & Related papers (2025-08-29T11:24:24Z)
Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [65.37942405146232]
We present a novel type of overload that carries with extremely lightweight state elements, achieved through ultra-low-precision quantization. The proposed SOLO achieves substantial memory savings (approximately 45 GB when training a 7B model) with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z)
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge [55.75103034526652]
We propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs.<n>Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost.<n>We design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability.
arXiv Detail & Related papers (2025-03-20T21:03:10Z)
Exploring Quantization and Mapping Synergy in Hardware-Aware Deep Neural Network Accelerators [0.20971479389679332]
Energy efficiency and memory footprint of a convolutional neural network (CNN) implemented on a CNN inference accelerator depend on many factors. We show that enabling rich mixed quantization schemes during the implementation can open a previously hidden space of mappings. CNNs utilizing quantized weights and activations and suitable mappings can significantly improve trade-offs among the accuracy, energy, and memory requirements.
arXiv Detail & Related papers (2024-04-08T10:10:30Z)
Post-Training Quantization for Re-parameterization via Coarse & Fine Weight Splitting [13.270381125055275]
We propose a coarse & fine weight splitting (CFWS) method to reduce quantization error of weight. We develop an improved KL metric to determine optimal quantization scales for activation. For example, the quantized RepVGG-A1 model exhibits a mere 0.3% accuracy loss.
arXiv Detail & Related papers (2023-12-17T02:31:20Z)
Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z)
Energy Efficient Hardware Acceleration of Neural Networks with Power-of-Two Quantisation [0.0]
We show that a hardware neural network accelerator with PoT weights implemented on the Zynq UltraScale + MPSoC ZCU104 FPGA can be at least $1.4x$ more energy efficient than the uniform quantisation version.
arXiv Detail & Related papers (2022-09-30T06:33:40Z)
BiTAT: Neural Network Binarization with Task-dependent Aggregated Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation. Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration. This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z)
Overcoming Oscillations in Quantization-Aware Training [18.28657022169428]
When training neural networks with simulated quantization, quantized weights can, rather unexpectedly, oscillate between two grid-points. We show that it can lead to a significant accuracy degradation due to wrongly estimated batch-normalization statistics. We propose two novel QAT algorithms to overcome oscillations during training: oscillation dampening and iterative weight freezing.
arXiv Detail & Related papers (2022-03-21T16:07:42Z)
Compact representations of convolutional neural networks via weight pruning and quantization [63.417651529192014]
We propose a novel storage format for convolutional neural networks (CNNs) based on source coding and leveraging both weight pruning and quantization. We achieve a reduction of space occupancy up to 0.6% on fully connected layers and 5.44% on the whole network, while performing at least as competitive as the baseline.
arXiv Detail & Related papers (2021-08-28T20:39:54Z)
HANT: Hardware-Aware Network Transformation [82.54824188745887]
We propose hardware-aware network transformation (HANT) HANT replaces inefficient operations with more efficient alternatives using a neural architecture search like approach. Our results on accelerating the EfficientNet family show that HANT can accelerate them by up to 3.6x with 0.4% drop in the top-1 accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-07-12T18:46:34Z)
SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage. We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation. We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z)
Weight Equalizing Shift Scaler-Coupled Post-training Quantization [0.5936318628878774]
Post-training, layer-wise quantization is preferable because it is free from retraining and is hardware-friendly. accuracy degradation has occurred when a neural network model has a big difference of per-out-channel weight ranges. We propose a new weight equalizing shift scaler, i.e. rescaling the weight range per channel by a 4-bit binary shift, prior to a layer-wise quantization.
arXiv Detail & Related papers (2020-08-13T09:19:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.