Related papers: EQuARX: Efficient Quantized AllReduce in XLA for Distributed Machine Learning Acceleration

EQuARX: Efficient Quantized AllReduce in XLA for Distributed Machine Learning Acceleration

URL: http://arxiv.org/abs/2506.17615v1
Date: Sat, 21 Jun 2025 06:54:52 GMT
Title: EQuARX: Efficient Quantized AllReduce in XLA for Distributed Machine Learning Acceleration
Authors: Ibrahim Ahmed, Clemens Schaefer, Gil Tabak, Denis Vnukov, Zenong Zhang, Felix chern, Anatoliy Yevtushenko, Andy Davis,
Abstract summary: We present a native dynamic block-wise efficient quantized AllReduce within the XLA compiler for TPUs (EQuARX)<n>By using TPU-friendly quantization and deep pipelining of communication and compute, EQuARX with int8 precision achieves a 1.8X speedup over baseline BF16 AllReduce.
Score: 3.757632817011334
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Large Language Models (LLMs) have become highly influential, their enormous scale presents significant deployment challenges. Efficiently serving these models typically requires distributing them across numerous accelerator devices, which introduces substantial performance overhead from inter-device communication (collectives). While model quantization has been widely adopted to reduce the memory and compute requirements of LLM weights and activations with minimal quality impact, applying quantization directly to collectives like AllReduce is inherently difficult due to the inter-device summation involved, which can lead to numerical instability or significant error accumulation. In this work, we present a native dynamic block-wise efficient quantized AllReduce within the XLA compiler for TPUs (EQuARX). By using TPU-friendly quantization and deep pipelining of communication and compute, EQuARX with int8 precision achieves a 1.8X speedup over baseline BF16 AllReduce across various network topologies. Furthermore, EQuARX accelerates the prefill stage of Gemma 3 27B by 1.25X and Gemma 3 12B by 1.1X, respectively, with small to negligible impact on quality.

Related papers

MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling [80.48332380100915]
MiniCPM-SALA is a hybrid model that integrates the high-fidelity long-context modeling of sparse attention with the global efficiency of linear attention.<n>On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens.
arXiv Detail & Related papers (2026-02-12T09:37:05Z)
MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning [91.90342432541138]
Scaling up model size and training data has advanced foundation models for instance-level perception.<n>High computational cost limits adoption on resource-constrained platforms.<n>We introduce a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
arXiv Detail & Related papers (2025-10-16T18:00:00Z)
QuantV2X: A Fully Quantized Multi-Agent System for Cooperative Perception [47.35478308553379]
We introduce textbfQuantV2X, the first fully quantized multi-agent system for efficient deployment of cooperative perception.<n>Despite operating under low-bit constraints, QuantV2X achieves accuracy comparable to full-precision systems.<n>Results highlight the viability of a fully quantized multi-agent intermediate fusion system for real-world deployment.
arXiv Detail & Related papers (2025-09-03T20:39:03Z)
Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative [31.74122603714625]
Mixture of Experts (MoE) architecture enhances model capacity through sparse activation.<n>MoE faces two major difficulties in practical deployment.<n>Under limited memory, efficient offloading and collaborative inference of expert modules struggle to balance latency and throughput.<n>This paper proposes an efficient MoE edge deployment scheme based on Hessian-Aware Quantization (HAQ) and CPU- GPU collaborative inference.
arXiv Detail & Related papers (2025-08-10T12:59:57Z)
MixPE: Quantization and Hardware Co-design for Efficient LLM Inference [16.42907854119748]
MixPE is a specialized mixed-precision processing element designed for efficient low-bit quantization in large language models. We show that MixPE surpasses the state-of-the-art quantization accelerators by $2.6times$ speedup and $1.4times$ energy reduction.
arXiv Detail & Related papers (2024-11-25T07:34:53Z)
ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models [9.444063879246242]
We introduce a novel arbitrary-bit quantization algorithm and inference framework, ABQ-LLM. It achieves superior performance across various quantization settings and enables efficient arbitrary-precision quantized inference on the GPU.
arXiv Detail & Related papers (2024-08-16T06:39:08Z)
Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.<n>At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z)
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z)
FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment. We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs. We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
Quantize Once, Train Fast: Allreduce-Compatible Compression with Provable Guarantees [53.950234267704]
We introduce Global-QSGD, an All-reduce gradient-compatible quantization method.<n>We show that it accelerates distributed training by up to 3.51% over baseline quantization methods.
arXiv Detail & Related papers (2023-05-29T21:32:15Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers [29.566132632781848]
We present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as ZeroQuant. ZeroQuant is an end-to-end quantization and inference pipeline with three main components.
arXiv Detail & Related papers (2022-06-04T00:28:21Z)
PAMS: Quantized Super-Resolution via Parameterized Max Scale [84.55675222525608]
Deep convolutional neural networks (DCNNs) have shown dominant performance in the task of super-resolution (SR) We propose a new quantization scheme termed PArameterized Max Scale (PAMS), which applies the trainable truncated parameter to explore the upper bound of the quantization range adaptively. Experiments demonstrate that the proposed PAMS scheme can well compress and accelerate the existing SR models such as EDSR and RDN.
arXiv Detail & Related papers (2020-11-09T06:16:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.