SmartExchange: Trading Higher-cost Memory Storage/Access for Lower-cost
Computation
- URL: http://arxiv.org/abs/2005.03403v2
- Date: Fri, 8 May 2020 07:35:09 GMT
- Title: SmartExchange: Trading Higher-cost Memory Storage/Access for Lower-cost
Computation
- Authors: Yang Zhao, Xiaohan Chen, Yue Wang, Chaojian Li, Haoran You, Yonggan
Fu, Yuan Xie, Zhangyang Wang, Yingyan Lin
- Abstract summary: We present SmartExchange, an algorithm- hardware co-design framework for energy-efficient inference of deep neural networks (DNNs)
We develop a novel algorithm to enforce a specially favorable DNN weight structure, where each layerwise weight matrix can be stored as the product of a small basis matrix and a large sparse coefficient matrix whose non-zero elements are all power-of-2.
We further design a dedicated accelerator to fully utilize the SmartExchange-enforced weights to improve both energy efficiency and latency performance.
- Score: 97.78417228445883
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present SmartExchange, an algorithm-hardware co-design framework to trade
higher-cost memory storage/access for lower-cost computation, for
energy-efficient inference of deep neural networks (DNNs). We develop a novel
algorithm to enforce a specially favorable DNN weight structure, where each
layerwise weight matrix can be stored as the product of a small basis matrix
and a large sparse coefficient matrix whose non-zero elements are all
power-of-2. To our best knowledge, this algorithm is the first formulation that
integrates three mainstream model compression ideas: sparsification or pruning,
decomposition, and quantization, into one unified framework. The resulting
sparse and readily-quantized DNN thus enjoys greatly reduced energy consumption
in data movement as well as weight storage. On top of that, we further design a
dedicated accelerator to fully utilize the SmartExchange-enforced weights to
improve both energy efficiency and latency performance. Extensive experiments
show that 1) on the algorithm level, SmartExchange outperforms state-of-the-art
compression techniques, including merely sparsification or pruning,
decomposition, and quantization, in various ablation studies based on nine DNN
models and four datasets; and 2) on the hardware level, the proposed
SmartExchange based accelerator can improve the energy efficiency by up to
6.7$\times$ and the speedup by up to 19.2$\times$ over four state-of-the-art
DNN accelerators, when benchmarked on seven DNN models (including four standard
DNNs, two compact DNN models, and one segmentation model) and three datasets.
Related papers
- Hardware-Aware DNN Compression via Diverse Pruning and Mixed-Precision
Quantization [1.0235078178220354]
We propose an automated framework to compress Deep Neural Networks (DNNs) in a hardware-aware manner by jointly employing pruning and quantization.
Our framework achieves $39%$ average energy reduction for datasets $1.7%$ average accuracy loss and outperforms significantly the state-of-the-art approaches.
arXiv Detail & Related papers (2023-12-23T18:50:13Z) - Sub-bit Neural Networks: Learning to Compress and Accelerate Binary
Neural Networks [72.81092567651395]
Sub-bit Neural Networks (SNNs) are a new type of binary quantization design tailored to compress and accelerate BNNs.
SNNs are trained with a kernel-aware optimization framework, which exploits binary quantization in the fine-grained convolutional kernel space.
Experiments on visual recognition benchmarks and the hardware deployment on FPGA validate the great potentials of SNNs.
arXiv Detail & Related papers (2021-10-18T11:30:29Z) - CREW: Computation Reuse and Efficient Weight Storage for
Hardware-accelerated MLPs and RNNs [1.0635248457021496]
We present CREW, a hardware accelerator that implements Reuse and an Efficient Weight Storage mechanism.
CREW greatly reduces the number of multiplications and provides significant savings in model memory footprint and memory bandwidth usage.
On average, CREW provides 2.61x speedup and 2.42x energy savings over a TPU-like accelerator.
arXiv Detail & Related papers (2021-07-20T11:10:54Z) - ActNN: Reducing Training Memory Footprint via 2-Bit Activation
Compressed Training [68.63354877166756]
ActNN is a memory-efficient training framework that stores randomly quantized activations for back propagation.
ActNN reduces the memory footprint of the activation by 12x, and it enables training with a 6.6x to 14x larger batch size.
arXiv Detail & Related papers (2021-04-29T05:50:54Z) - SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and
Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage.
We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation.
We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z) - Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration [14.958793135751149]
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM)
Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead.
We address a key architectural challenge with structural sparsity: how to provide support for a range of sparsity levels while maintaining high utilization of the hardware.
arXiv Detail & Related papers (2020-09-04T20:17:42Z) - Bit Error Robustness for Energy-Efficient DNN Accelerators [93.58572811484022]
We show that a combination of robust fixed-point quantization, weight clipping, and random bit error training (RandBET) improves robustness against random bit errors.
This leads to high energy savings from both low-voltage operation as well as low-precision quantization.
arXiv Detail & Related papers (2020-06-24T18:23:10Z) - PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal
Matrices [35.90103072918056]
Deep neural network (DNN) has emerged as the most important and popular artificial intelligent (AI) technique.
The growth of model size poses a key energy efficiency challenge for the underlying computing platform.
This paper proposes PermDNN, a novel approach to generate and execute hardware-friendly structured sparse DNN models.
arXiv Detail & Related papers (2020-04-23T02:26:40Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.