Related papers: MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

URL: http://arxiv.org/abs/2406.09297v2
Date: Sun, 16 Jun 2024 03:57:51 GMT
Title: MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding
Authors: Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji,
Abstract summary: We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage. Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss.
Score: 13.272502334507617
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV's potential for efficient deployment of transformer models at scale. We provide code at https://github.com/zaydzuhri/pythia-mlkv

Related papers

KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding [72.12756830560217]
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI.<n>Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value cache during inference has emerged as a primary efficiency bottleneck.<n>By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed.
arXiv Detail & Related papers (2025-07-15T12:52:12Z)
Tensor Product Attention Is All You Need [54.40495407154611]
Product Attention (TPA) is a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly. TPA achieves improved model quality alongside memory efficiency. We introduce the ProducT ATTion Transformer (T6), a new model architecture for sequence modeling.
arXiv Detail & Related papers (2025-01-11T03:37:10Z)
Multi-matrix Factorization Attention [59.10039136733939]
We propose Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR) MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads. MFA-KR further reduces memory requirements by repurposing the key cache as value.
arXiv Detail & Related papers (2024-12-26T15:45:45Z)
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [58.29726147780976]
We propose a plug-and-play method called textit KVSharer, which shares the KV cache between layers to achieve layer-wise compression. Experiments show that textit KVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption. We verify that textit KVSharer is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.
arXiv Detail & Related papers (2024-10-24T08:06:41Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention [19.796549720022554]
We show that it is possible to take Multi-Query Attention a step further by also sharing key and value heads between adjacent layers. We find that it is possible to reduce the size of the KV cache by another 2x while maintaining nearly the same accuracy as unmodified MQA.
arXiv Detail & Related papers (2024-05-21T17:59:29Z)
Layer-Condensed KV Cache for Efficient Inference of Large Language Models [44.24593677113768]
We propose a novel method that only computes and caches the KVs of a small number of layers. Our method achieves up to 26$times$ higher throughput than standard transformers.
arXiv Detail & Related papers (2024-05-17T08:59:46Z)
CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference [1.9639467358416092]
Transformers have emerged as the backbone of large language models (LLMs) We propose Dynamic Memory Compression (DMC), a method for online key-value cache compression at inference time. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to 7x throughput increase during auto-regressive inference on an NVIDIA H100 GPU.
arXiv Detail & Related papers (2024-03-14T17:59:26Z)
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI. KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z)
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [67.74400574357472]
LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision. Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods.
arXiv Detail & Related papers (2024-01-31T18:58:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.