LoMA: Lossless Compressed Memory Attention
- URL: http://arxiv.org/abs/2401.09486v2
- Date: Sun, 4 Feb 2024 03:14:08 GMT
- Title: LoMA: Lossless Compressed Memory Attention
- Authors: Yumeng Wang, Zhenyang Xiao
- Abstract summary: Lossless Compressed Memory Attention (LoMA) is a novel approach to reduce memory and computational demands during autoregressive generation.
LoMA incorporates a specialized training or fine-tuning precedure alongside an autoregressive generation algorithm optimized for the compressed context.
Experimental validation has demonstrated that LoMA significantly reducing computational consumption and memory usage.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) face limitations due to the high demand on GPU
memory and computational resources when handling long contexts. While sparsify
the Key-Value (KV) cache of transformer model is a typical strategy to
alleviate resource usage, it unavoidably results in the loss of information. We
introduce Lossless Compressed Memory Attention (LoMA), a novel approach that
enables lossless compression of the KV cache, thereby reducing the memory and
computational demands during autoregressive generation. LoMA incorporates a
specialized training or fine-tuning precedure alongside an autoregressive
generation algorithm optimized for the compressed context. Our method
compresses the KV cache after every $tc$ generated tokens with a compression
ratio of $c$ and a target compressed length $t$, and this process occurs within
a single inference pass without dependency on auxiliary models. We engineered
an efficient training scheme involving specific inputs, attention masks, and
position identifiers to instill this compression capability. Experimental
validation has demonstrated that LoMA significantly reducing computational
consumption and memory usage through achieving lossless KV cache compression.
Related papers
- HERA: High-efficiency Matrix Compression via Element Replacement [5.858734684979008]
Large Language Models (LLMs) have advanced natural language processing tasks such as machine translation, text generation, and sentiment analysis.
Their large size, often consisting of billions of parameters, poses challenges for storage, computation, and deployment.
We propose HERA, a novel algorithm that employs Element Replacement for compressing matrix.
arXiv Detail & Related papers (2024-07-04T05:13:58Z) - Effectively Compress KV Heads for LLM [28.0801697946958]
We propose a novel approach for compressing Key-Value ( KV) caches.
Our method can compress half or even three-quarters of KV heads while maintaining performance comparable to the original LLMs.
arXiv Detail & Related papers (2024-06-11T08:37:33Z) - MiniCache: KV Cache Compression in Depth Dimension for Large Language Models [48.03117580340151]
Key-Value ( KV) cache stores key-value states of previously generated tokens.
The size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation.
We present a simple yet effective approach, called MiniCache, to compress the KV cache across layers from a novel depth perspective.
arXiv Detail & Related papers (2024-05-23T09:43:52Z) - ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification [19.985314022860432]
KV cache stores key and value states from previous tokens to avoid re-computation.
KV cache compression seeks to discern the saliency of tokens, preserving vital information while aggressively compressing those of less importance.
We present ZipCache, an accurate and efficient KV cache quantization method for LLMs.
arXiv Detail & Related papers (2024-05-23T07:37:16Z) - Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models.
Existing methods often compromise precision or require extra data for calibration.
We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z) - PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference [57.53291046180288]
Large Language Models (LLMs) have shown remarkable comprehension abilities but face challenges in GPU memory usage during inference.
We propose PyramidInfer, a method that compresses the KV cache by layer-wise retaining crucial context.
PyramidInfer improves 2.2x throughput compared to Accelerate with over 54% GPU memory reduction in KV cache.
arXiv Detail & Related papers (2024-05-21T06:46:37Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference [1.9639467358416092]
Transformers have emerged as the backbone of large language models (LLMs)
We propose Dynamic Memory Compression (DMC), a method for online key-value cache compression at inference time.
We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to 7x throughput increase during auto-regressive inference on an NVIDIA H100 GPU.
arXiv Detail & Related papers (2024-03-14T17:59:26Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z) - Neural Network Compression for Noisy Storage Devices [71.4102472611862]
Conventionally, model compression and physical storage are decoupled.
This approach forces the storage to treat each bit of the compressed model equally, and to dedicate the same amount of resources to each bit.
We propose a radically different approach that: (i) employs analog memories to maximize the capacity of each memory cell, and (ii) jointly optimize model compression and physical storage to maximize memory utility.
arXiv Detail & Related papers (2021-02-15T18:19:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.