Scalable MatMul-free Language Modeling
- URL: http://arxiv.org/abs/2406.02528v5
- Date: Tue, 18 Jun 2024 17:30:06 GMT
- Title: Scalable MatMul-free Language Modeling
- Authors: Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, Jason K. Eshraghian,
- Abstract summary: We show that MatMul operations can be completely eliminated from large language models.
Our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers.
- Score: 8.672867887354977
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Matrix multiplication (MatMul) typically dominates the overall computational cost of large language models (LLMs). This cost only grows as LLMs scale to larger embedding dimensions and context lengths. In this work, we show that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales. Our experiments show that our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers that require far more memory during inference at a scale up to at least 2.7B parameters. We investigate the scaling laws and find that the performance gap between our MatMul-free models and full precision Transformers narrows as the model size increases. We also provide a GPU-efficient implementation of this model which reduces memory usage by up to 61% over an unoptimized baseline during training. By utilizing an optimized kernel during inference, our model's memory consumption can be reduced by more than 10x compared to unoptimized models. To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of. We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency. This work not only shows how far LLMs can be stripped back while still performing effectively, but also points at the types of operations future accelerators should be optimized for in processing the next generation of lightweight LLMs. Our code implementation is available at https://github.com/ridgerchu/matmulfreellm.
Related papers
- CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation [17.807249890437767]
We introduce CoLA and its memory-efficient implementation, CoLA-M.
We leverage the low-rank structure observed widely in model activations to reduce model size, boost model capacity and training efficiency.
Experiments on LLaMA models with 60 million to 7 billion parameters show that CoLA reduces the computing cost by $bf 2pmbtimes$ and improves training throughput by $bf 1.86pmbtimes$ while maintaining full-rank level performance.
arXiv Detail & Related papers (2025-02-16T01:05:16Z) - Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers [65.35142508909892]
We present a novel four-dimensional hybrid parallel algorithm implemented in a highly scalable, portable, open-source framework called AxoNN.
We demonstrate fine-tuning of a 405-billion parameter LLM using AxoNN on Frontier.
arXiv Detail & Related papers (2025-02-12T06:05:52Z) - APOLLO: SGD-like Memory, AdamW-level Performance [61.53444035835778]
Large language models (LLMs) are notoriously memory-intensive during training.
Various memory-efficient Scals have been proposed to reduce memory usage.
They face critical challenges: (i) costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial memory overhead to maintain competitive performance.
arXiv Detail & Related papers (2024-12-06T18:55:34Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models
for Financial Applications with High-Performance Computing [10.47214968497857]
We present high-performance methods that exploit low-rank structures to pretrain and finetune large language models.
Our methods achieve a speedup of 1.3X and a model compression ratio of 2.64X for pretaining without accuracy drop.
For finetuning, our methods achieve an average accuracy increase of 6.3% and 24.0% in general tasks and financial tasks.
arXiv Detail & Related papers (2024-02-21T05:03:17Z) - Scaling Sparse Fine-Tuning to Large Language Models [67.59697720719672]
Large Language Models (LLMs) are difficult to fully fine-tune due to their sheer number of parameters.
We propose SpIEL, a novel sparse finetuning method which maintains an array of parameter indices and the deltas of these parameters relative to their pretrained values.
We show that SpIEL is superior to popular parameter-efficient fine-tuning methods like LoRA in terms of performance and comparable in terms of run time.
arXiv Detail & Related papers (2024-01-29T18:43:49Z) - FlightLLM: Efficient Large Language Model Inference with a Complete
Mapping Flow on FPGAs [23.381331567339526]
Transformer-based Large Language Models (LLMs) have made a significant impact on various domains.
This paper proposes FlightLLM, enabling efficient LLMs inference with a complete mapping flow on FPGAs.
FlightLLM beats NVIDIA A100 GPU with 1.2$times$ higher throughput using the latest Versal VHK158 FPGA.
arXiv Detail & Related papers (2024-01-08T13:00:53Z) - Efficient LLM Inference on CPUs [8.802223672775844]
Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks.
deploying these models has been challenging due to the astronomical amount of model parameters.
We propose an effective approach that can make the deployment of LLMs more efficiently.
arXiv Detail & Related papers (2023-11-01T13:08:50Z) - Full Parameter Fine-tuning for Large Language Models with Limited Resources [55.794732214059806]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training.
We propose a new computation, LOw-Memory Optimization (LOMO), which fuses the gradient and the parameter update in one step to reduce memory usage.
arXiv Detail & Related papers (2023-06-16T11:37:15Z) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers.
A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.