Dissecting the Runtime Performance of the Training, Fine-tuning, and
Inference of Large Language Models
- URL: http://arxiv.org/abs/2311.03687v2
- Date: Fri, 1 Dec 2023 15:37:07 GMT
- Title: Dissecting the Runtime Performance of the Training, Fine-tuning, and
Inference of Large Language Models
- Authors: Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo
Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi, Xiaowen Chu
- Abstract summary: Large Language Models (LLMs) have seen great advance in both academia and industry.
We benchmark the end-to-end performance of pre-training, fine-tuning, and serving LLMs in different sizes.
Then, we dive deeper to provide a detailed runtime analysis of the sub-modules, including computing and communication operators in LLMs.
- Score: 26.2566707495948
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have seen great advance in both academia and
industry, and their popularity results in numerous open-source frameworks and
techniques in accelerating LLM pre-training, fine-tuning, and inference.
Training and deploying LLMs are expensive as it requires considerable computing
resources and memory, hence many efficient approaches have been developed for
improving system pipelines as well as operators. However, the runtime
performance can vary significantly across hardware and software stacks, which
makes it difficult to choose the best configuration. In this work, we aim to
benchmark the performance from both macro and micro perspectives. First, we
benchmark the end-to-end performance of pre-training, fine-tuning, and serving
LLMs in different sizes , i.e., 7, 13, and 70 billion parameters (7B, 13B, and
70B) on three 8-GPU platforms with and without individual optimization
techniques, including ZeRO, quantization, recomputation, FlashAttention. Then,
we dive deeper to provide a detailed runtime analysis of the sub-modules,
including computing and communication operators in LLMs. For end users, our
benchmark and findings help better understand different optimization
techniques, training and inference frameworks, together with hardware platforms
in choosing configurations for deploying LLMs. For researchers, our in-depth
module-wise analyses discover potential opportunities for future work to
further optimize the runtime performance of LLMs.
Related papers
- Inference Performance Optimization for Large Language Models on CPUs [4.7230692120532485]
Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks.
When GPU hardware resources are limited, we can explore alternative options on CPUs.
In this paper, we introduce an easily deployable inference performance optimization solution aimed at accelerating LLMs on CPUs.
arXiv Detail & Related papers (2024-07-10T01:53:49Z) - Teola: Towards End-to-End Optimization of LLM-based Applications [13.478509565946354]
Large language model (LLM)-based applications contribute to the end-to-end latency.
Existing frameworks employ coarse-grained orchestration with task modules, which confines optimizations to within each module.
We propose fine-grained end-to-end orchestration, which utilizes task primitives as the basic units and represents each query's workflow as a primitive-level dataflow graph.
arXiv Detail & Related papers (2024-06-29T05:59:53Z) - ORLM: Training Large Language Models for Optimization Modeling [16.348267803499404]
Large Language Models (LLMs) have emerged as powerful tools for tackling complex Operations Research (OR) problem.
To tackle this issue, we propose training open-source LLMs for optimization modeling.
Our best-performing ORLM achieves state-of-the-art performance on the NL4OPT, MAMO, and IndustryOR benchmarks.
arXiv Detail & Related papers (2024-05-28T01:55:35Z) - Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond [24.151927600694066]
Commit Message Generation (CMG) approaches aim to automatically generate commit messages based on given code diffs.
This paper conducts the first comprehensive experiment to investigate how far we have been in applying Large Language Models (LLMs) to generate high-quality commit messages.
arXiv Detail & Related papers (2024-04-23T08:24:43Z) - LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges.
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model.
This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z) - Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [90.14693869269519]
MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes.
This paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques.
arXiv Detail & Related papers (2024-02-22T18:56:07Z) - CoLLiE: Collaborative Training of Large Language Models in an Efficient
Way [59.09824823710863]
CoLLiE is an efficient library that facilitates collaborative training of large language models.
With its modular design and comprehensive functionality, CoLLiE offers a balanced blend of efficiency, ease of use, and customization.
arXiv Detail & Related papers (2023-12-01T08:02:16Z) - Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems.
We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z) - FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large
Language Models in Federated Learning [70.38817963253034]
This paper first discusses these challenges of federated fine-tuning LLMs, and introduces our package FS-LLM as a main contribution.
We provide comprehensive federated parameter-efficient fine-tuning algorithm implementations and versatile programming interfaces for future extension in FL scenarios.
We conduct extensive experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings.
arXiv Detail & Related papers (2023-09-01T09:40:36Z) - Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation
with Large Language Models [12.708117108874083]
Large Language Models (LLMs) generate code snippets given natural language intents in zero-shot, i.e., without the need for specific fine-tuning.
Previous research explored In-Context Learning (ICL) as a strategy to guide the LLM generative process with task-specific prompt examples.
In this paper, we deliver a comprehensive study of.
PEFT techniques for LLMs under the automated code generation scenario.
arXiv Detail & Related papers (2023-08-21T04:31:06Z) - Learning Performance-Improving Code Edits [107.21538852090208]
We introduce a framework for adapting large language models (LLMs) to high-level program optimization.
First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs.
For prompting, we propose retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play.
arXiv Detail & Related papers (2023-02-15T18:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.