LLM in a flash: Efficient Large Language Model Inference with Limited Memory
- URL: http://arxiv.org/abs/2312.11514v3
- Date: Tue, 30 Jul 2024 23:37:20 GMT
- Title: LLM in a flash: Efficient Large Language Model Inference with Limited Memory
- Authors: Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar,
- Abstract summary: Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks.
This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity.
Our method involves constructing an inference cost model that takes into account the characteristics of flash memory.
- Score: 19.668719251238176
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their substantial computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. Within this hardware-informed framework, we introduce two principal techniques. First, "windowing" strategically reduces data transfer by reusing previously activated neurons, and second, "row-column bundling", tailored to the sequential data access strengths of flash memory, increases the size of data chunks read from flash memory. These methods collectively enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively. Our integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design paves the way for effective inference of LLMs on devices with limited memory.
Related papers
- vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - $\text{Memory}^3$: Language Modeling with Explicit Memory [22.572376536612015]
We equip large language models (LLMs) with explicit memory, a memory format cheaper than model parameters and text retrieval-augmented generation (RAG)
As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs and RAG models.
We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable.
arXiv Detail & Related papers (2024-07-01T11:07:23Z) - MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory [49.96019697955383]
We introduce MemLLM, a novel method of enhancing knowledge capabilities by integrating a structured and explicit read-and-write memory module.
Our experiments indicate that MemLLM enhances performance and interpretability, in language modeling general and in particular.
We see MemLLM as an important step towards making LLMs more grounded and factual through memory augmentation.
arXiv Detail & Related papers (2024-04-17T18:13:16Z) - AI and Memory Wall [81.06494558184049]
We show how memory bandwidth can become the dominant bottleneck for decoder models.
We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
arXiv Detail & Related papers (2024-03-21T04:31:59Z) - Online Adaptation of Language Models with a Memory of Amortized Contexts [82.02369596879817]
Memory of Amortized Contexts (MAC) is an efficient and effective online adaptation framework for large language models.
We show how MAC can be combined with and improve the performance of popular alternatives such as retrieval augmented generations.
arXiv Detail & Related papers (2024-03-07T08:34:57Z) - Spatial Variation-Aware Read Disturbance Defenses: Experimental Analysis of Real DRAM Chips and Implications on Future Solutions [6.731882555515892]
We present the first rigorous real DRAM chip characterization study of spatial variation of read disturbance.
We propose Sv"ard, a new mechanism that dynamically adapts the aggressiveness of existing solutions based on the row-level read disturbance profile.
arXiv Detail & Related papers (2024-02-28T19:00:55Z) - GLIMMER: generalized late-interaction memory reranker [29.434777627686692]
Memory-augmentation is a powerful approach for incorporating external information into language models.
Recent work introduced LUMEN, a memory-retrieval hybrid that partially pre-computes memory and updates memory representations on the fly with a smaller live encoder.
We propose GLIMMER, which improves on this approach through 1) exploiting free access to the powerful memory representations by applying a shallow reranker on top of memory to drastically improve retrieval quality at low cost.
arXiv Detail & Related papers (2023-06-17T01:54:25Z) - CAMEL: Co-Designing AI Models and Embedded DRAMs for Efficient On-Device
Learning [8.339901980070616]
Training AI on resource-limited devices poses significant challenges due to the demanding computing workload and the substantial memory consumption and data access required by deep neural networks (DNNs)
We propose utilizing embedded dynamic random-access memory (eDRAM) as the primary storage medium for transient training data.
We present a highly efficient on-device training engine named textitCAMEL, which leverages eDRAM as the primary on-chip memory.
arXiv Detail & Related papers (2023-05-04T20:57:01Z) - Learning to Rank Graph-based Application Objects on Heterogeneous
Memories [0.0]
This paper describes a methodology for identifying and characterizing application objects that have the most influence on the application's performance.
By performing data placement using our predictive model, we can reduce the execution time degradation by 12% (average) and 30% (max) when compared to the baseline's approach.
arXiv Detail & Related papers (2022-11-04T00:20:31Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.