TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network
- URL: http://arxiv.org/abs/2506.01352v1
- Date: Mon, 02 Jun 2025 06:13:41 GMT
- Title: TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network
- Authors: Guangxin He, Yuan Cao, Yutong He, Tianyi Bai, Kun Yuan, Binhang Yuan,
- Abstract summary: We introduce TAH-Quant (Tile-wise Adaptive Hadamard Quantization), a novel activation quantization framework designed specifically for pipeline parallelism.<n>Our approach integrates fine-grained tile-wise quantization for precise control, entropy-guided token-level adaptive bit allocation for optimal bit usage, and a Hadamard-based transform with pivot element swapping to effectively suppress quantization outliers.
- Score: 21.231881562816373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Decentralized training of large language models offers the opportunity to pool computational resources across geographically distributed participants but faces significant network communication bottlenecks, particularly in pipeline-parallel settings. While pipeline parallelism partitions model layers across devices to handle large-scale models, it necessitates frequent communication of intermediate activations, creating challenges when network bandwidth is limited. Existing activation compression methods, such as AQ-SGD, mitigate quantization-induced errors through error compensation but impose prohibitive memory overhead by requiring storage of previous activations. To address these issues, we introduce TAH-Quant (Tile-wise Adaptive Hadamard Quantization), a novel activation quantization framework designed specifically for pipeline parallelism. Our approach integrates fine-grained tile-wise quantization for precise control, entropy-guided token-level adaptive bit allocation for optimal bit usage, and a Hadamard-based transform with pivot element swapping to effectively suppress quantization outliers. We further provide a theoretical analysis, proving that pipeline parallel training equipped with TAH-Quant maintains a convergence rate of $\mathcal{O}(1/\sqrt{T})$, matching that of vanilla stochastic gradient descent. Extensive experiments on diverse LLM tasks demonstrate that TAH-Quant achieves aggressive activation quantization (3-4 bits) ratio, which provides up to 4.3$\times$ end-to-end speedup without compromising training convergence, matches state-of-the-art methods, incurs no extra memory overhead, and generalizes well across different training scenarios.
Related papers
- LRQ-DiT: Log-Rotation Post-Training Quantization of Diffusion Transformers for Text-to-Image Generation [34.14174796390669]
Post-training quantization (PTQ) is a promising solution to reduce memory usage and accelerate inference.<n>Existing PTQ methods suffer from severe performance degradation under extreme low-bit settings.<n>We propose LRQ-DiT, an efficient and accurate PTQ framework.
arXiv Detail & Related papers (2025-08-05T14:16:11Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - AdaQAT: Adaptive Bit-Width Quantization-Aware Training [0.873811641236639]
Large-scale deep neural networks (DNNs) have achieved remarkable success in many application scenarios.
Model quantization is a common approach to deal with deployment constraints, but searching for optimized bit-widths can be challenging.
We present Adaptive Bit-Width Quantization Aware Training (AdaQAT), a learning-based method that automatically optimize bit-widths during training for more efficient inference.
arXiv Detail & Related papers (2024-04-22T09:23:56Z) - CBQ: Cross-Block Quantization for Large Language Models [66.82132832702895]
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs.<n>We propose CBQ, a cross-block reconstruction-based PTQ method for LLMs.<n> CBQ employs a cross-block dependency using a reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation.
arXiv Detail & Related papers (2023-12-13T07:56:27Z) - PIPE : Parallelized Inference Through Post-Training Quantization
Ensembling of Residual Expansions [23.1120983784623]
PIPE is a quantization method that leverages residual error expansion, along with group sparsity and an ensemble approximation for better parallelization.
It achieves superior performance on every benchmarked application (from vision to NLP tasks), architecture (ConvNets, transformers) and bit-width.
arXiv Detail & Related papers (2023-11-27T13:29:34Z) - Quantize Once, Train Fast: Allreduce-Compatible Compression with Provable Guarantees [53.950234267704]
We introduce Global-QSGD, an All-reduce gradient-compatible quantization method.<n>We show that it accelerates distributed training by up to 3.51% over baseline quantization methods.
arXiv Detail & Related papers (2023-05-29T21:32:15Z) - Adaptive Quantization of Model Updates for Communication-Efficient
Federated Learning [75.45968495410047]
Communication of model updates between client nodes and the central aggregating server is a major bottleneck in federated learning.
Gradient quantization is an effective way of reducing the number of bits required to communicate each model update.
We propose an adaptive quantization strategy called AdaFL that aims to achieve communication efficiency as well as a low error floor.
arXiv Detail & Related papers (2021-02-08T19:14:21Z) - Gradient $\ell_1$ Regularization for Quantization Robustness [70.39776106458858]
We derive a simple regularization scheme that improves robustness against post-training quantization.
By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths.
arXiv Detail & Related papers (2020-02-18T12:31:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.