Related papers: Activation Compression of Graph Neural Networks using Block-wise Quantization with Improved Variance Minimization

Activation Compression of Graph Neural Networks using Block-wise Quantization with Improved Variance Minimization

URL: http://arxiv.org/abs/2309.11856v2
Date: Tue, 16 Jan 2024 17:44:40 GMT
Title: Activation Compression of Graph Neural Networks using Block-wise Quantization with Improved Variance Minimization
Authors: Sebastian Eliassen, Raghavendra Selvan
Abstract summary: We present an improvement to the EXACT strategy by using block-wise quantization of the intermediate activation maps. We show further reduction in memory consumption (>15%), and runtime speedup per epoch (about 5%) even when performing extreme extents of quantization.
Score: 0.21756081703275998
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Efficient training of large-scale graph neural networks (GNNs) has been studied with a specific focus on reducing their memory consumption. Work by Liu et al. (2022) proposed extreme activation compression (EXACT) which demonstrated drastic reduction in memory consumption by performing quantization of the intermediate activation maps down to using INT2 precision. They showed little to no reduction in performance while achieving large reductions in GPU memory consumption. In this work, we present an improvement to the EXACT strategy by using block-wise quantization of the intermediate activation maps. We experimentally analyze different block sizes and show further reduction in memory consumption (>15%), and runtime speedup per epoch (about 5%) even when performing extreme extents of quantization with similar performance trade-offs as with the original EXACT. Further, we present a correction to the assumptions on the distribution of intermediate activation maps in EXACT (assumed to be uniform) and show improved variance estimations of the quantization and dequantization steps.

Related papers

Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [65.37942405146232]
We present a novel type of overload that carries with extremely lightweight state elements, achieved through ultra-low-precision quantization. The proposed SOLO achieves substantial memory savings (approximately 45 GB when training a 7B model) with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z)
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models. Existing methods often compromise precision or require extra data for calibration. We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z)
Compressing the Backward Pass of Large-Scale Neural Architectures by Structured Activation Pruning [0.0]
Sparsity in Deep Neural Networks (DNNs) has gained attention as a solution. This work focuses on ephemeral sparsity, aiming to reduce memory consumption during training. We report the effectiveness of activation pruning by evaluating training speed, accuracy, and memory usage of large-scale neural architectures.
arXiv Detail & Related papers (2023-11-28T15:31:31Z)
Towards Memory- and Time-Efficient Backpropagation for Training Spiking Neural Networks [70.75043144299168]
Spiking Neural Networks (SNNs) are promising energy-efficient models for neuromorphic computing. We propose the Spatial Learning Through Time (SLTT) method that can achieve high performance while greatly improving training efficiency. Our method achieves state-of-the-art accuracy on ImageNet, while the memory cost and training time are reduced by more than 70% and 50%, respectively, compared with BPTT.
arXiv Detail & Related papers (2023-02-28T05:01:01Z)
Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z)
Learnable Mixed-precision and Dimension Reduction Co-design for Low-storage Activation [9.838135675969026]
Deep convolutional neural networks (CNNs) have achieved many eye-catching results. deploying CNNs on resource-constrained edge devices is constrained by limited memory bandwidth for transmitting large intermediated data during inference. We propose a learnable mixed-precision and dimension reduction co-design system, which separates channels into groups and allocates compression policies according to their importance.
arXiv Detail & Related papers (2022-07-16T12:53:52Z)
BiTAT: Neural Network Binarization with Task-dependent Aggregated Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation. Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration. This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z)
LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models [9.727062803700264]
We introduce LUT-GEMM, an efficient kernel for quantized matrix multiplication. LUT-GEMM eliminates the resource-intensive dequantization process and reduces computational costs. We show experimentally that when applied to the OPT-175B model with 3-bit quantization, LUT-GEMM substantially accelerates token generation latency.
arXiv Detail & Related papers (2022-06-20T03:48:17Z)
Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers. Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z)
Compression-aware Projection with Greedy Dimension Reduction for Convolutional Neural Network Activations [3.6188659868203388]
We propose a compression-aware projection system to improve the trade-off between classification accuracy and compression ratio. Our test results show that the proposed methods effectively reduce 2.91x5.97x memory access with negligible accuracy drop on MobileNetV2/ResNet18/VGG16.
arXiv Detail & Related papers (2021-10-17T14:02:02Z)
ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training [68.63354877166756]
ActNN is a memory-efficient training framework that stores randomly quantized activations for back propagation. ActNN reduces the memory footprint of the activation by 12x, and it enables training with a 6.6x to 14x larger batch size.
arXiv Detail & Related papers (2021-04-29T05:50:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.