GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy
Efficient Inference
- URL: http://arxiv.org/abs/2005.03842v2
- Date: Sun, 27 Sep 2020 00:09:30 GMT
- Title: GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy
Efficient Inference
- Authors: Ali Hadi Zadeh, Isak Edo, Omar Mohamed Awad, and Andreas Moshovos
- Abstract summary: We present GOBO, a model quantization technique that compresses the vast majority (typically 99.9%) of the 32-bit floating-point parameters of state-of-the-art BERT models.
Unlike other quantization methods, GOBO does not require fine-tuning nor retraining to compensate for the quantization error.
GOBO architecture maintains most of the weights in 3b even during computation.
- Score: 1.6534387701595552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention-based models have demonstrated remarkable success in various
natural language understanding tasks. However, efficient execution remains a
challenge for these models which are memory-bound due to their massive number
of parameters. We present GOBO, a model quantization technique that compresses
the vast majority (typically 99.9%) of the 32-bit floating-point parameters of
state-of-the-art BERT models and their variants to 3 bits while maintaining
their accuracy. Unlike other quantization methods, GOBO does not require
fine-tuning nor retraining to compensate for the quantization error. We present
two practical hardware applications of GOBO. In the first GOBO reduces memory
storage and traffic and as a result inference latency and energy consumption.
This GOBO memory compression mechanism is plug-in compatible with many
architectures; we demonstrate it with the TPU, Eyeriss, and an architecture
using Tensor Cores-like units. Second, we present a co-designed hardware
architecture that also reduces computation. Uniquely, the GOBO architecture
maintains most of the weights in 3b even during computation, a property that:
(1) makes the processing elements area efficient, allowing us to pack more
compute power per unit area, (2) replaces most multiply-accumulations with
additions, and (3) reduces the off-chip traffic by amplifying on-chip memory
capacity.
Related papers
- SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment [5.141876811512978]
SmallThinker is a family of large language models (LLMs) designed for local devices.<n>We introduce a two-level sparse structure combining fine-grained Mixture-of-Experts (MoE) with sparse feed-forward networks.<n>We release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and even outperform larger LLMs.
arXiv Detail & Related papers (2025-07-28T16:45:14Z) - CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs [45.77132019859689]
CalibQuant is a visual quantization strategy that drastically reduces both memory and computational overhead.
We achieve a 10x throughput increase on InternVL models.
arXiv Detail & Related papers (2025-02-15T05:08:01Z) - Neuromorphic Principles for Efficient Large Language Models on Intel Loihi 2 [5.213433310722838]
Large language models (LLMs) deliver impressive performance but require large amounts of energy.
We present a MatMul-free LLM architecture adapted for Intel's neuromorphic processor, Loihi 2.
Our approach leverages Loihi 2's support for low-precision, event-driven computation and stateful processing.
arXiv Detail & Related papers (2025-02-12T02:40:44Z) - Less Memory Means smaller GPUs: Backpropagation with Compressed Activations [1.7065506903618906]
The ever-growing scale of deep neural networks (DNNs) has lead to an equally rapid growth in computational resource requirements.
Many recent architectures, most prominently Large Language Models, have to be trained using supercomputers with thousands of accelerators.
With this approach we are able to reduce the peak memory consumption by 29% at the cost of a longer training schedule.
arXiv Detail & Related papers (2024-09-18T11:57:05Z) - Kolmogorov-Arnold Transformer [72.88137795439407]
We introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces layers with Kolmogorov-Arnold Network (KAN) layers.
We identify three key challenges: (C1) Base function, (C2) Inefficiency, and (C3) Weight.
With these designs, KAT outperforms traditional-based transformers.
arXiv Detail & Related papers (2024-09-16T17:54:51Z) - B'MOJO: Hybrid State Space Realizations of Foundation Models with Eidetic and Fading Memory [91.81390121042192]
We develop a class of models called B'MOJO to seamlessly combine eidetic and fading memory within an composable module.
B'MOJO's ability to modulate eidetic and fading memory results in better inference on longer sequences tested up to 32K tokens.
arXiv Detail & Related papers (2024-07-08T18:41:01Z) - AI and Memory Wall [81.06494558184049]
We show how memory bandwidth can become the dominant bottleneck for decoder models.
We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
arXiv Detail & Related papers (2024-03-21T04:31:59Z) - Simple linear attention language models balance the recall-throughput tradeoff [60.06020449520365]
We propose BASED, a simple architecture combining linear and sliding window attention.
We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points.
arXiv Detail & Related papers (2024-02-28T19:28:27Z) - Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative
Model Inference with Unstructured Sparsity [12.663030430488922]
We propose Flash-LLM for enabling low-cost and highly-efficient large generative model inference on high-performance Cores.
At SpMM kernel level, Flash-LLM significantly outperforms the state-of-the-art library, i.e., Sputnik and SparTA by an average of 2.9x and 1.5x, respectively.
arXiv Detail & Related papers (2023-09-19T03:20:02Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - A Co-design view of Compute in-Memory with Non-Volatile Elements for
Neural Networks [12.042322495445196]
We discuss how compute-in-memory can play an important part in the next generation of computing hardware.
A non-volatile memory based cross-bar architecture forms the heart of an engine that uses an analog process to parallelize the matrix vector multiplication operation.
The cross-bar architecture, at times referred to as a neuromorphic approach, can be a key hardware element in future computing machines.
arXiv Detail & Related papers (2022-06-03T15:59:46Z) - PoET-BiN: Power Efficient Tiny Binary Neurons [1.7274221736253095]
We propose PoET-BiN, a Look-Up Table based power efficient implementation on resource constrained embedded devices.
A modified Decision Tree approach forms the backbone of the proposed implementation in the binary domain.
A LUT access consumes far less power than the equivalent Multiply Accumulate operation it replaces, and the modified Decision Tree algorithm eliminates the need for memory accesses.
arXiv Detail & Related papers (2020-02-23T00:32:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.