Related papers: SySMOL: Co-designing Algorithms and Hardware for Neural Networks with Heterogeneous Precisions

SySMOL: Co-designing Algorithms and Hardware for Neural Networks with Heterogeneous Precisions

URL: http://arxiv.org/abs/2311.14114v2
Date: Mon, 6 May 2024 19:32:50 GMT
Title: SySMOL: Co-designing Algorithms and Hardware for Neural Networks with Heterogeneous Precisions
Authors: Cyrus Zhou, Pedro Savarese, Vaughn Richard, Zack Hassman, Xin Yuan, Michael Maire, Michael DiBrino, Yanjing Li,
Abstract summary: Recent quantization techniques have enabled heterogeneous precisions at very fine granularity. These networks require additional hardware to decode the precision settings for individual variables, align the variables, and provide fine-grained mixed-precision compute capabilities. We present an end-to-end co-design approach to efficiently execute networks with fine-grained heterogeneous precisions.
Score: 20.241671088121144
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent quantization techniques have enabled heterogeneous precisions at very fine granularity, e.g., each parameter/activation can take on a different precision, resulting in compact neural networks without sacrificing accuracy. However, there is a lack of efficient architectural support for such networks, which require additional hardware to decode the precision settings for individual variables, align the variables, and provide fine-grained mixed-precision compute capabilities. The complexity of these operations introduces high overheads. Thus, the improvements in inference latency/energy of these networks are not commensurate with the compression ratio, and may be inferior to larger quantized networks with uniform precisions. We present an end-to-end co-design approach encompassing computer architecture, training algorithm, and inference optimization to efficiently execute networks with fine-grained heterogeneous precisions. The key to our approach is a novel training algorithm designed to accommodate hardware constraints and inference operation requirements, outputting networks with input-channel-wise heterogeneous precisions and at most three precision levels. Combined with inference optimization techniques, existing architectures with low-cost enhancements can support such networks efficiently, yielding optimized tradeoffs between accuracy, compression ratio and inference latency/energy. We demonstrate the efficacy of our approach across CPU and GPU architectures. For various representative neural networks, our approach achieves >10x improvements in both compression ratio and inference latency, with negligible degradation in accuracy compared to full-precision networks.

Related papers

Complexity-Aware Training of Deep Neural Networks for Optimal Structure Discovery [0.0]
We propose a novel algorithm for combined unit and layer pruning of deep neural networks that functions during training and without requiring a pre-trained network to apply.<n>Our algorithm optimally trades-off learning accuracy and pruning levels while balancing layer vs. unit pruning and computational vs. parameter complexity.<n>We show that the proposed algorithm converges to solutions of the optimization problem corresponding to networks.
arXiv Detail & Related papers (2024-11-14T02:00:22Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices. For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z)
Free Bits: Latency Optimization of Mixed-Precision Quantized Neural Networks on the Edge [17.277918711842457]
Mixed-precision quantization offers the opportunity to optimize the trade-offs between model size, latency, and statistical accuracy. This paper proposes a hybrid search methodology to navigate the search space of mixed-precision configurations for a given network. It consists of a hardware-agnostic differentiable search algorithm followed by a hardware-aware optimization to find mixed-precision configurations latency-optimized for a specific hardware target.
arXiv Detail & Related papers (2023-07-06T09:57:48Z)
A Practical Mixed Precision Algorithm for Post-Training Quantization [15.391257986051249]
Mixed-precision quantization is a promising solution to find a better performance-efficiency trade-off than homogeneous quantization. We present a simple post-training mixed precision algorithm that only requires a small unlabeled calibration dataset. We show that we can find mixed precision networks that provide a better trade-off between accuracy and efficiency than their homogeneous bit-width equivalents.
arXiv Detail & Related papers (2023-02-10T17:47:54Z)
Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference [3.3213055774512648]
Quantizing networks to lower precision is a powerful technique for simplifying networks. Mixed precision quantization methods selectively tune the precision of individual layers to achieve a minimum drop in task performance. To estimate the impact of layer precision choice on task performance, two methods are introduced. Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers.
arXiv Detail & Related papers (2023-01-30T23:26:33Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks. specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples. We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z)
Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z)
GradFreeBits: Gradient Free Bit Allocation for Dynamic Low Precision Neural Networks [4.511923587827301]
Quantized neural networks (QNNs) are among the main approaches for deploying deep neural networks on low resource edge devices. We propose GradFreeBits: a novel joint optimization scheme for training dynamic QNNs. Our method achieves better or on par performance with current state of the art low precision neural networks on CIFAR10/100 and ImageNet classification.
arXiv Detail & Related papers (2021-02-18T12:18:09Z)
Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters. Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques. We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.