FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models
for Financial Applications with High-Performance Computing
- URL: http://arxiv.org/abs/2402.13533v1
- Date: Wed, 21 Feb 2024 05:03:17 GMT
- Title: FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models
for Financial Applications with High-Performance Computing
- Authors: Xiao-Yang Liu, Jie Zhang, Guoxuan Wang, Weiqing Tong, and Anwar Walid
- Abstract summary: We present high-performance methods that exploit low-rank structures to pretrain and finetune large language models.
Our methods achieve a speedup of 1.3X and a model compression ratio of 2.64X for pretaining without accuracy drop.
For finetuning, our methods achieve an average accuracy increase of 6.3% and 24.0% in general tasks and financial tasks.
- Score: 10.47214968497857
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are computationally intensive. The computation
workload and the memory footprint grow quadratically with the dimension (layer
width). Most of LLMs' parameters come from the linear layers of the transformer
structure and are highly redundant. These linear layers contribute more than
80% of the computation workload and 99% of the model size. To pretrain and
finetune LLMs efficiently, there are three major challenges to address: 1)
reducing redundancy of the linear layers; 2) reducing GPU memory footprint; 3)
improving GPU utilization when using distributed training. Prior methods, such
as LoRA and QLoRA, utilized low-rank matrices and quantization to reduce the
number of trainable parameters and model size, respectively. However, the
resulting model still consumes a large amount of GPU memory. In this paper, we
present high-performance GPU-based methods that exploit low-rank structures to
pretrain and finetune LLMs for financial applications. We replace one
conventional linear layer of the transformer structure with two narrower linear
layers, which allows us to reduce the number of parameters by several orders of
magnitude. By quantizing the parameters into low precision (8-bit and 4-bit),
the memory consumption of the resulting model is further reduced. Compared with
existing LLMs, our methods achieve a speedup of 1.3X and a model compression
ratio of 2.64X for pretaining without accuracy drop. For finetuning, our
methods achieve an average accuracy increase of 6.3% and 24.0% in general tasks
and financial tasks, respectively, and GPU memory consumption ratio of 6.3X.
The sizes of our models are smaller than 0.59 GB, allowing inference on a
smartphone.
Related papers
- Scalable MatMul-free Language Modeling [8.672867887354977]
We show that MatMul operations can be completely eliminated from large language models.
Our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers.
arXiv Detail & Related papers (2024-06-04T17:50:34Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states.
In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy.
Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z) - FlattenQuant: Breaking Through the Inference Compute-bound for Large
Language Models with Per-tensor Quantization [6.931020818874328]
We introduce a method called FlattenQuant, which significantly reduces the maximum value of the tensor by flattening the large channels in the tensor, to achieve low bit per-tensor quantization with minimal accuracy loss.
Our work achieves up to 2$times$ speedup and 2.3$times$ memory reduction for LLMs with negligible loss in accuracy.
arXiv Detail & Related papers (2024-02-28T02:00:34Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment.
We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs.
We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression [76.73007709690306]
We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique.
SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs.
This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
arXiv Detail & Related papers (2023-06-05T17:53:28Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.