BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based
Quantized DNNs
- URL: http://arxiv.org/abs/2005.09904v2
- Date: Mon, 31 Aug 2020 05:43:28 GMT
- Title: BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based
Quantized DNNs
- Authors: Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin
Yun, and Dongsoo Lee
- Abstract summary: The number of parameters in deep neural networks (DNNs) is rapidly increasing to support complicated tasks and to improve model accuracy.
We propose a novel matrix multiplication method, called BiQGEMM, dedicated to quantized DNNs.
- Score: 7.635154697466773
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The number of parameters in deep neural networks (DNNs) is rapidly increasing
to support complicated tasks and to improve model accuracy. Correspondingly,
the amount of computations and required memory footprint increase as well.
Quantization is an efficient method to address such concerns by compressing
DNNs such that computations can be simplified while required storage footprint
is significantly reduced. Unfortunately, commercial CPUs and GPUs do not fully
support quantization because only fixed data transfers (such as 32 bits) are
allowed. As a result, even if weights are quantized into a few bits, CPUs and
GPUs cannot access multiple quantized weights without memory bandwidth waste.
Success of quantization in practice, hence, relies on an efficient computation
engine design, especially for matrix multiplication that is a basic computation
engine in most DNNs. In this paper, we propose a novel matrix multiplication
method, called BiQGEMM, dedicated to quantized DNNs. BiQGEMM can access
multiple quantized weights simultaneously in one instruction. In addition,
BiQGEMM pre-computes intermediate results that are highly redundant when
quantization leads to limited available computation space. Since pre-computed
values are stored in lookup tables and reused, BiQGEMM achieves lower amount of
overall computations. Our extensive experimental results show that BiQGEMM
presents higher performance than conventional schemes when DNNs are quantized.
Related papers
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.
At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z) - NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search [7.971065005161565]
quantization is a technique to convert floating point representations to low bit-width fixed point representations.
We show how to learn new quantized weights over the entire quantized space.
We show the ability of the method to achieve state-of-the-art compression rates in both, data-free and data-driven configurations.
arXiv Detail & Related papers (2023-08-10T14:19:58Z) - MINT: Multiplier-less INTeger Quantization for Energy Efficient Spiking
Neural Networks [20.473852621915956]
We propose a uniform quantization scheme that efficiently compresses weights and membrane potentials in spiking neural networks (SNNs)
MINT quantizes membrane potentials to an extremely low precision (2-bit), significantly reducing the memory footprint.
Experimental results show that our method matches the accuracy of full-precision models and other state-of-the-art SNN quantization techniques.
arXiv Detail & Related papers (2023-05-16T23:38:35Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Fast matrix multiplication for binary and ternary CNNs on ARM CPU [0.9135092203041721]
We propose fast algorithms of ternary, ternary-binary, and binary matrix multiplication for mobile devices with ARM architecture.
Our algorithms can be used to implement inference of convolutional and fully connected layers of TNNs, TBNs, and BNNs.
We evaluate them experimentally on ARM Cortex-A73 CPU and compare their inference speed to efficient implementations of full-precision, 8-bit, and 4-bit quantized matrix multiplications.
arXiv Detail & Related papers (2022-05-18T14:52:34Z) - Low-bit Quantization of Recurrent Neural Network Language Models Using
Alternating Direction Methods of Multipliers [67.688697838109]
This paper presents a novel method to train quantized RNNLMs from scratch using alternating direction methods of multipliers (ADMM)
Experiments on two tasks suggest the proposed ADMM quantization achieved a model size compression factor of up to 31 times over the full precision baseline RNNLMs.
arXiv Detail & Related papers (2021-11-29T09:30:06Z) - OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - CREW: Computation Reuse and Efficient Weight Storage for
Hardware-accelerated MLPs and RNNs [1.0635248457021496]
We present CREW, a hardware accelerator that implements Reuse and an Efficient Weight Storage mechanism.
CREW greatly reduces the number of multiplications and provides significant savings in model memory footprint and memory bandwidth usage.
On average, CREW provides 2.61x speedup and 2.42x energy savings over a TPU-like accelerator.
arXiv Detail & Related papers (2021-07-20T11:10:54Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.