ASPEN: High-Throughput LoRA Fine-Tuning of Large Language Models with a
Single GPU
- URL: http://arxiv.org/abs/2312.02515v1
- Date: Tue, 5 Dec 2023 05:38:38 GMT
- Title: ASPEN: High-Throughput LoRA Fine-Tuning of Large Language Models with a
Single GPU
- Authors: Zhengmao Ye and Dengchun Li and Jingqi Tian and Tingfeng Lan and Jie
Zuo and Lei Duan and Hui Lu and Yexi Jiang and Jian Sha and Ke Zhang and
Mingjie Tang
- Abstract summary: We present ASPEN, a framework for fine-tuning transformer-based large language models (LLMs)
ASPEN efficiently trains multiple jobs on a single GPU using the LoRA method, leveraging shared pre-trained model and adaptive scheduling.
Experiments show that ASPEN saves 53% of GPU memory when training multiple LLaMA-7B models on NVIDIA A100 80GB GPU.
- Score: 4.198627205271621
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based large language models (LLMs) have demonstrated outstanding
performance across diverse domains, particularly when fine-turned for specific
domains. Recent studies suggest that the resources required for fine-tuning
LLMs can be economized through parameter-efficient methods such as Low-Rank
Adaptation (LoRA). While LoRA effectively reduces computational burdens and
resource demands, it currently supports only a single-job fine-tuning setup.
In this paper, we present ASPEN, a high-throughput framework for fine-tuning
LLMs. ASPEN efficiently trains multiple jobs on a single GPU using the LoRA
method, leveraging shared pre-trained model and adaptive scheduling. ASPEN is
compatible with transformer-based language models like LLaMA and ChatGLM, etc.
Experiments show that ASPEN saves 53% of GPU memory when training multiple
LLaMA-7B models on NVIDIA A100 80GB GPU and boosts training throughput by about
17% compared to existing methods when training with various pre-trained models
on different GPUs. The adaptive scheduling algorithm reduces turnaround time by
24%, end-to-end training latency by 12%, prioritizing jobs and preventing
out-of-memory issues.
Related papers
- FedBiOT: LLM Local Fine-tuning in Federated Learning without Full Model [48.33280660752336]
Large language models (LLMs) show amazing performance on many domain-specific tasks after fine-tuning with some appropriate data.
Many domain-specific data are privately distributed across multiple owners.
We introduce FedBiOT, a resource-efficient LLM fine-tuning approach to federated learning.
arXiv Detail & Related papers (2024-06-25T16:45:47Z) - Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands.
Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs.
In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z) - MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts [3.6301530893494127]
MixLoRA is an approach to construct a resource-efficient sparse MoE model based on LoRA.
Our evaluations show that MixLoRA improves about 9% accuracy compared to state-of-the-art PEFT methods in multi-task learning scenarios.
arXiv Detail & Related papers (2024-04-22T02:15:52Z) - LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning [31.088229461632206]
Large language models (LLMs) have become a significant roadblock to large-scale training.
Low-Rank Adaptation (LoRA) have been proposed to alleviate this problem.
We investigate the layerwise properties of LoRA on fine-tuning tasks and observe an unexpected but consistent skewness of weight norms.
We name it Layerwise Importance Sampled AdamW (LISA), a promising alternative for LoRA.
arXiv Detail & Related papers (2024-03-26T17:55:02Z) - JORA: JAX Tensor-Parallel LoRA Library for Retrieval Augmented Fine-Tuning [16.86356520836045]
We introduce a novel framework for PEFT-compatible fine-tuning of Llama-2 models, leveraging distributed training.
Our framework uniquely utilizes JAX's just-in-time (JIT) compilation and tensor-sharding for efficient resource management.
Our experiments show more than 12x improvement in runtime compared to Hugging Face/DeepSpeed implementation with four GPU while consuming less than half the VRAM per GPU.
arXiv Detail & Related papers (2024-03-17T23:02:04Z) - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states.
In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy.
Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z) - Chain of LoRA: Efficient Fine-tuning of Language Models via Residual
Learning [31.036465632204663]
We introduce Chain of LoRA, an iterative optimization framework inspired by the Frank-Wolfe algorithm.
We demonstrate that COLA can consistently outperform LoRA without additional computational or memory costs.
arXiv Detail & Related papers (2024-01-08T14:26:49Z) - Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs)
Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs.
Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z) - Full Parameter Fine-tuning for Large Language Models with Limited Resources [55.794732214059806]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training.
We propose a new computation, LOw-Memory Optimization (LOMO), which fuses the gradient and the parameter update in one step to reduce memory usage.
arXiv Detail & Related papers (2023-06-16T11:37:15Z) - LoRA: Low-Rank Adaptation of Large Language Models [71.75808607987281]
Low-Rank Adaptation, or LoRA, freezes the pre-trained model weights and injects trainable rank decomposition into each layer of the Transformer architecture.
For GPT-3, LoRA can reduce the number of trainable parameters by 10,000 times and the computation hardware requirement by 3 times compared to full fine-tuning.
arXiv Detail & Related papers (2021-06-17T17:37:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.