ISO: Overlap of Computation and Communication within Seqenence For LLM Inference
- URL: http://arxiv.org/abs/2409.11155v1
- Date: Wed, 4 Sep 2024 05:22:17 GMT
- Title: ISO: Overlap of Computation and Communication within Seqenence For LLM Inference
- Authors: Bin Xiao, Lei Su,
- Abstract summary: This paper introduces a novel strategy for computation-communication overlap that operates at the sequence level.
Experimental evaluations conducted using 30b/70b models have demonstrated significant improvements in efficiency.
- Score: 8.616769297336708
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the realm of Large Language Model (LLM) inference, the inherent structure of transformer models coupled with the multi-GPU tensor parallelism strategy leads to a sequential execution of computation and communication. This results in substantial underutilization of computing resources during the communication phase. To mitigate this inefficiency, various techniques have been developed to optimize the use of computational power throughout the communication process. These strategies primarily involve overlapping matrix computations and communications, as well as interleaving micro-batches across different requests. Nonetheless, these approaches either fall short of achieving ideal overlap or impose certain limitations on their application. To overcome these challenges, this paper introduces a novel strategy for computation-communication overlap that operates at the sequence level. This method not only enhances the degree of overlap but also minimizes the constraints on its applicability. Experimental evaluations conducted using 30b/70b models have demonstrated significant improvements in efficiency. Specifically, the proposed technique has been shown to reduce time consumption by approximately 35% on 4090 GPU and by roughly 15% on A800 GPU during the prefill stage of LLM inference.
Related papers
- EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE.
Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations [8.881243419237608]
We propose three key innovations for efficient interactive long context inference.
These are adaptive chunking to reduce prefill overheads in mixed, Sequence Pipeline Parallelism (SPP) and Cache Parallelism (KVP)
These contributions are combined into a 3D strategy, enabling Mnemosyne to scale interactive inference to context lengths at least up to 10 million tokens with high throughput enabled with parallelism.
arXiv Detail & Related papers (2024-09-25T18:21:05Z) - Geometric Clustering for Hardware-Efficient Implementation of Chromatic Dispersion Compensation [2.8870882078316855]
This paper provides a theoretical analysis of the tap overlapping effect in CDC filters for coherent receivers.
We introduce a novel Time-Domain Clustered Equalizer (TDCE) technique based on this concept.
We develop an innovative parallelization method for TDCE, implementing it in hardware for fiber lengths up to 640 km.
arXiv Detail & Related papers (2024-09-16T15:48:05Z) - ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models [14.310720048047136]
ALPS is an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned gradient conjugate-based post-processing step.
Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency.
On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.
arXiv Detail & Related papers (2024-06-12T02:57:41Z) - ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training [16.560270624096706]
We propose a memory-efficient optimization algorithm tailored for distributed training of Large Language Models.
Our method relies on a novel technique to mitigate the one-step delay inherent in parallel execution of gradient computations and communications.
arXiv Detail & Related papers (2024-06-03T08:23:45Z) - Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts [4.629608387540524]
We present a novel shortcut-connected MoE (ScMoE) architecture with an overlapping parallel strategy.
ScMoE allows for a substantial overlap of 70% to 100% with computation.
Building on the ScMoE architecture, we further implement an expert offloading strategy to facilitate memory-limited inference.
arXiv Detail & Related papers (2024-04-07T17:17:23Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - AsySQN: Faster Vertical Federated Learning Algorithms with Better
Computation Resource Utilization [159.75564904944707]
We propose an asynchronous quasi-Newton (AsySQN) framework for vertical federated learning (VFL)
The proposed algorithms make descent steps scaled by approximate without calculating the inverse Hessian matrix explicitly.
We show that the adopted asynchronous computation can make better use of the computation resource.
arXiv Detail & Related papers (2021-09-26T07:56:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.