Dynamic Stashing Quantization for Efficient Transformer Training
- URL: http://arxiv.org/abs/2303.05295v1
- Date: Thu, 9 Mar 2023 14:44:31 GMT
- Title: Dynamic Stashing Quantization for Efficient Transformer Training
- Authors: Guo Yang, Daniel Lo, Robert Mullins, Yiren Zhao
- Abstract summary: Large Language Models (LLMs) have demonstrated impressive performance on a range of Natural Language Processing (NLP) tasks.
The immense amount of computations and memory accesses required for LLM training makes them prohibitively expensive in terms of hardware cost.
We propose a novel dynamic quantization strategy, termed Dynamic Stashing Quantization (DSQ), that puts a special focus on reducing the memory operations.
- Score: 4.930533932212726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated impressive performance on a
range of Natural Language Processing (NLP) tasks. Unfortunately, the immense
amount of computations and memory accesses required for LLM training makes them
prohibitively expensive in terms of hardware cost, and thus challenging to
deploy in use cases such as on-device learning. In this paper, motivated by the
observation that LLM training is memory-bound, we propose a novel dynamic
quantization strategy, termed Dynamic Stashing Quantization (DSQ), that puts a
special focus on reducing the memory operations, but also enjoys the other
benefits of low precision training, such as the reduced arithmetic cost. We
conduct a thorough study on two translation tasks (trained-from-scratch) and
three classification tasks (fine-tuning). DSQ reduces the amount of arithmetic
operations by $20.95\times$ and the number of DRAM operations by $2.55\times$
on IWSLT17 compared to the standard 16-bit fixed-point, which is widely used in
on-device learning.
Related papers
- Towards Accurate and Efficient Sub-8-Bit Integer Training [24.853958178296587]
Quantization enables low-bitwidth formats in neural network training.
Recent methods have developed new data formats and additional pre-processing operations on quantizers.
It remains quite challenging to achieve high accuracy and efficiency simultaneously.
arXiv Detail & Related papers (2024-11-17T03:32:36Z) - DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.
PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.
Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z) - Gated Slot Attention for Efficient Linear-Time Sequence Modeling [59.019501274074564]
Gated Slot Attention (GSA) enhances Attention with Bounded-memory-Control (ABC)
GSA incorporates a gating mechanism inspired by Gated Linear Attention (GLA)
arXiv Detail & Related papers (2024-09-11T09:49:50Z) - Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training [78.93900796545523]
Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences.
MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage.
integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.
arXiv Detail & Related papers (2024-07-22T01:52:30Z) - EfficientQAT: Efficient Quantization-Aware Training for Large Language Models [50.525259103219256]
quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss.
We propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm.
EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP)
arXiv Detail & Related papers (2024-07-10T17:53:30Z) - ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization [13.622268474310918]
ShiftAddLLM is an efficient multiplication-free model for large language models.
It achieves perplexity improvements of 5.6 and 22.7 points at comparable or lower latency.
Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM.
arXiv Detail & Related papers (2024-06-10T02:47:55Z) - Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM [6.85331857224501]
Large Language Models (LLMs) pose significant hardware challenges related to memory requirements and computational ability.
There are two mainstream quantization schemes for LLMs: coarse-grained ($textite.g.,$ channel-wise) quantization and fine-grained ($textite.g.,$ group-wise) quantization.
We introduce Dual Grained Quantization (DGQ), a novel A8W4 quantization for LLM that maintains superior performance while ensuring fast inference speed.
arXiv Detail & Related papers (2023-10-07T14:50:28Z) - Hadamard Domain Training with Integers for Class Incremental Quantized
Learning [1.4416751609100908]
Continual learning can be cost-prohibitive for resource-constraint edge platforms.
We propose a technique that transforms to enable low-precision training with only integer matrix multiplications.
We achieve less than 0.5% and 3% accuracy degradation while we quantize all matrix multiplications inputs down to 4-bits with 8-bit accumulators.
arXiv Detail & Related papers (2023-10-05T16:52:59Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.