APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU
Tensor Cores
- URL: http://arxiv.org/abs/2106.12169v1
- Date: Wed, 23 Jun 2021 05:39:34 GMT
- Title: APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU
Tensor Cores
- Authors: Boyuan Feng, Yuke Wang, Tong Geng, Ang Li, Yufei Ding
- Abstract summary: We introduce the first Arbitrary Precision Neural Network framework (APNN-TC) to fully exploit quantization benefits on Ampere Cores.
APNN-TC supports arbitrary short bit-width computation with int1 compute primitives and XOR/AND operations.
It can achieve significant speedup overLAS CUTS kernels and various NN models, such as ResNet and VGG.
- Score: 19.516279899089735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Over the years, accelerating neural networks with quantization has been
widely studied. Unfortunately, prior efforts with diverse precisions (e.g.,
1-bit weights and 2-bit activations) are usually restricted by limited
precision support on GPUs (e.g., int1 and int4). To break such restrictions, we
introduce the first Arbitrary Precision Neural Network framework (APNN-TC) to
fully exploit quantization benefits on Ampere GPU Tensor Cores. Specifically,
APNN-TC first incorporates a novel emulation algorithm to support arbitrary
short bit-width computation with int1 compute primitives and XOR/AND Boolean
operations. Second, APNN-TC integrates arbitrary precision layer designs to
efficiently map our emulation algorithm to Tensor Cores with novel batching
strategies and specialized memory organization. Third, APNN-TC embodies a novel
arbitrary precision NN design to minimize memory access across layers and
further improve performance. Extensive evaluations show that APNN-TC can
achieve significant speedup over CUTLASS kernels and various NN models, such as
ResNet and VGG.
Related papers
- ReActXGB: A Hybrid Binary Convolutional Neural Network Architecture for Improved Performance and Computational Efficiency [0.0]
We propose a hybrid model named ReActXGB, where we replace the fully convolutional layer of ReActNet-A with XGBoost.
This modification targets to narrow the performance gap between BCNNs and real-valued networks while maintaining lower computational costs.
arXiv Detail & Related papers (2024-05-11T16:38:50Z) - Compacting Binary Neural Networks by Sparse Kernel Selection [58.84313343190488]
This paper is motivated by a previously revealed phenomenon that the binary kernels in successful BNNs are nearly power-law distributed.
We develop the Permutation Straight-Through Estimator (PSTE) that is able to not only optimize the selection process end-to-end but also maintain the non-repetitive occupancy of selected codewords.
Experiments verify that our method reduces both the model size and bit-wise computational costs, and achieves accuracy improvements compared with state-of-the-art BNNs under comparable budgets.
arXiv Detail & Related papers (2023-03-25T13:53:02Z) - Training Integer-Only Deep Recurrent Neural Networks [3.1829446824051195]
We present a quantization-aware training method for obtaining a highly accurate integer-only recurrent neural network (iRNN)
Our approach supports layer normalization, attention, and an adaptive piecewise linear (PWL) approximation of activation functions.
The proposed method enables RNN-based language models to run on edge devices with $2times$ improvement in runtime.
arXiv Detail & Related papers (2022-12-22T15:22:36Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - Sub-bit Neural Networks: Learning to Compress and Accelerate Binary
Neural Networks [72.81092567651395]
Sub-bit Neural Networks (SNNs) are a new type of binary quantization design tailored to compress and accelerate BNNs.
SNNs are trained with a kernel-aware optimization framework, which exploits binary quantization in the fine-grained convolutional kernel space.
Experiments on visual recognition benchmarks and the hardware deployment on FPGA validate the great potentials of SNNs.
arXiv Detail & Related papers (2021-10-18T11:30:29Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - Ax-BxP: Approximate Blocked Computation for Precision-Reconfigurable
Deep Neural Network Acceleration [3.7371886886933487]
Precision scaling has emerged as a popular technique to optimize the compute and storage requirements of Deep Neural Networks (DNNs)
Efforts toward creating ultra-low-precision (sub-8-bit) DNNs suggest that the minimum precision required to achieve a given network-level accuracy varies considerably across networks.
Previous proposals such as bit-serial hardware incur high overheads, significantly diminishing the benefits of lower precision.
arXiv Detail & Related papers (2020-11-25T20:00:38Z) - FATNN: Fast and Accurate Ternary Neural Networks [89.07796377047619]
Ternary Neural Networks (TNNs) have received much attention due to being potentially orders of magnitude faster in inference, as well as more power efficient, than full-precision counterparts.
In this work, we show that, under some mild constraints, computational complexity of the ternary inner product can be reduced by a factor of 2.
We elaborately design an implementation-dependent ternary quantization algorithm to mitigate the performance gap.
arXiv Detail & Related papers (2020-08-12T04:26:18Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.