Related papers: SPD: Sync-Point Drop for efficient tensor parallelism of Large Language Models

SPD: Sync-Point Drop for efficient tensor parallelism of Large Language Models

URL: http://arxiv.org/abs/2502.20727v1
Date: Fri, 28 Feb 2025 05:20:48 GMT
Title: SPD: Sync-Point Drop for efficient tensor parallelism of Large Language Models
Authors: Han-Byul Kim, Duc Hoang, Arnav Kundu, Mohammad Samragh, Minsik Cho,
Abstract summary: We introduce Sync-Point Drop (SPD) to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs.<n>SPD offered about 20% overall inference latency reduction with 1% accuracy regression for LLaMA2-70B inference over 8 GPUs.
Score: 6.065998616707588
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20% overall inference latency reduction with < 1% accuracy regression for LLaMA2-70B inference over 8 GPUs.

Related papers

Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference [14.805702987440512]
We introduce Flash Communication, a novel low-bit compression technique designed to alleviate the tensor-parallelism communication bottleneck during inference.<n>Our method substantially boosts intra-node communication speed by more than 3x and reduces the time-to-first-token by 2x, with nearly no sacrifice in model accuracy.
arXiv Detail & Related papers (2024-12-06T11:29:32Z)
Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding. PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models. Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z)
Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST) IST is a recently proposed and highly effective technique for solving the aforementioned problems. We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z)
Efficient Parallel Split Learning over Resource-constrained Wireless Edge Networks [44.37047471448793]
In this paper, we advocate the integration of edge computing paradigm and parallel split learning (PSL) We propose an innovative PSL framework, namely, efficient parallel split learning (EPSL) to accelerate model training. We show that the proposed EPSL framework significantly decreases the training latency needed to achieve a target accuracy.
arXiv Detail & Related papers (2023-03-26T16:09:48Z)
Design and Prototyping Distributed CNN Inference Acceleration in Edge Computing [85.74517957717363]
HALP accelerates inference by designing a seamless collaboration among edge devices (EDs) in Edge Computing. Experiments show that the distributed inference HALP achieves 1.7x inference acceleration for VGG-16. It is shown that the model selection with distributed inference HALP can significantly improve service reliability.
arXiv Detail & Related papers (2022-11-24T19:48:30Z)
Predictive GAN-powered Multi-Objective Optimization for Hybrid Federated Split Learning [56.125720497163684]
We propose a hybrid federated split learning framework in wireless networks. We design a parallel computing scheme for model splitting without label sharing, and theoretically analyze the influence of the delayed gradient caused by the scheme on the convergence speed.
arXiv Detail & Related papers (2022-09-02T10:29:56Z)
Receptive Field-based Segmentation for Distributed CNN Inference Acceleration in Collaborative Edge Computing [93.67044879636093]
We study inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing network. We propose a novel collaborative edge computing using fused-layer parallelization to partition a CNN model into multiple blocks of convolutional layers.
arXiv Detail & Related papers (2022-07-22T18:38:11Z)
Over-the-Air Federated Learning via Second-Order Optimization [37.594140209854906]
Federated learning (FL) could result in task-oriented data traffic flows over wireless networks with limited radio resources. We propose a novel over-the-air second-order federated optimization algorithm to simultaneously reduce the communication rounds and enable low-latency global model aggregation.
arXiv Detail & Related papers (2022-03-29T12:39:23Z)
SPDY: Accurate Pruning with Speedup Guarantees [29.284147465251685]
SPDY is a new compression method which automatically determines layer-wise sparsity targets achieving a desired inference speedup. We show that SPDY guarantees speedups while recovering higher accuracy relative to existing strategies, both for one-shot and gradual pruning scenarios. We also extend our approach to the recently-proposed task of pruning with very little data, where we achieve the best known accuracy recovery when pruning to the GPU-supported 2:4 sparsity pattern.
arXiv Detail & Related papers (2022-01-31T10:14:31Z)
Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network. We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z)
LCP: A Low-Communication Parallelization Method for Fast Neural Network Inference in Image Recognition [33.581285906182075]
We propose a low-communication parallelization (LCP) method in which models consist of several almost-independent and narrow branches. We deploy LCP models on three distributed systems: AWS instances, Raspberry Pis, and PYNQ boards. LCP models achieve a maximum and average speedups of 56x and 7x, compared to the originals, which could be improved by up to an average speedup of 33x.
arXiv Detail & Related papers (2020-03-13T19:52:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.