QCQA: Quality and Capacity-aware grouped Query Attention
- URL: http://arxiv.org/abs/2406.10247v1
- Date: Sat, 8 Jun 2024 07:49:55 GMT
- Title: QCQA: Quality and Capacity-aware grouped Query Attention
- Authors: Vinay Joshi, Prashant Laddha, Shambhavi Sinha, Om Ji Omer, Sreenivas Subramoney,
- Abstract summary: Excessive memory requirements of key and value features (KV-cache) present significant challenges in the autoregressive inference of large language models (LLMs)
We propose Quality and Capacity-Aware Grouped Query Attention (QCQA) which identifies optimal query head groupings using an evolutionary algorithm with a computationally efficient and inexpensive fitness function.
- Score: 5.121164018825873
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Excessive memory requirements of key and value features (KV-cache) present significant challenges in the autoregressive inference of large language models (LLMs), restricting both the speed and length of text generation. Approaches such as Multi-Query Attention (MQA) and Grouped Query Attention (GQA) mitigate these challenges by grouping query heads and consequently reducing the number of corresponding key and value heads. However, MQA and GQA decrease the KV-cache size requirements at the expense of LLM accuracy (quality of text generation). These methods do not ensure an optimal tradeoff between KV-cache size and text generation quality due to the absence of quality-aware grouping of query heads. To address this issue, we propose Quality and Capacity-Aware Grouped Query Attention (QCQA), which identifies optimal query head groupings using an evolutionary algorithm with a computationally efficient and inexpensive fitness function. We demonstrate that QCQA achieves a significantly better tradeoff between KV-cache capacity and LLM accuracy compared to GQA. For the Llama2 $7\,$B model, QCQA achieves $\mathbf{20}$\% higher accuracy than GQA with similar KV-cache size requirements in the absence of fine-tuning. After fine-tuning both QCQA and GQA, for a similar KV-cache size, QCQA provides $\mathbf{10.55}\,$\% higher accuracy than GQA. Furthermore, QCQA requires $40\,$\% less KV-cache size than GQA to attain similar accuracy. The proposed quality and capacity-aware grouping of query heads can serve as a new paradigm for KV-cache optimization in autoregressive LLM inference.
Related papers
- Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning [19.942402563256962]
Key-Value (KV) caching is a common technique to enhance the computational efficiency of Large Language Models (LLMs)
We propose HeadKV, a head-level KV cache compression method, and HeadKV-R2, which leverages a novel contextual reasoning ability estimation for compression.
Our method retains just 1.5% of the KV cache while achieving 97% of the performance of the full KV cache on the contextual question answering benchmark.
arXiv Detail & Related papers (2024-10-25T02:22:00Z) - Boosting CLIP Adaptation for Image Quality Assessment via Meta-Prompt Learning and Gradient Regularization [55.09893295671917]
This paper introduces a novel Gradient-Regulated Meta-Prompt IQA Framework (GRMP-IQA)
The GRMP-IQA comprises two key modules: Meta-Prompt Pre-training Module and Quality-Aware Gradient Regularization.
Experiments on five standard BIQA datasets demonstrate the superior performance to the state-of-the-art BIQA methods under limited data setting.
arXiv Detail & Related papers (2024-09-09T07:26:21Z) - Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention [3.3457276841127315]
Transformer architecture has revolutionized deep learning through its Self-Attention mechanism.
Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads.
We introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping.
arXiv Detail & Related papers (2024-08-15T23:34:04Z) - Reducing Transformer Key-Value Cache Size with Cross-Layer Attention [19.796549720022554]
We show that it is possible to take Multi-Query Attention a step further by also sharing key and value heads between adjacent layers.
We find that it is possible to reduce the size of the KV cache by another 2x while maintaining nearly the same accuracy as unmodified MQA.
arXiv Detail & Related papers (2024-05-21T17:59:29Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - QAQ: Quality Adaptive Quantization for LLM KV Cache [3.163526369095745]
A bottleneck in model deployment emerges due to the linear expansion of the Key-Value cache with the context length.
We propose QAQ, a Quality Adaptive Quantization scheme for the KV cache.
arXiv Detail & Related papers (2024-03-07T16:42:37Z) - GQA: Training Generalized Multi-Query Transformer Models from Multi-Head
Checkpoints [25.154477500940626]
We propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute.
We show that uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA.
arXiv Detail & Related papers (2023-05-22T17:16:38Z) - Matching Game for Optimized Association in Quantum Communication
Networks [65.16483325184237]
This paper proposes a swap-stable request-QS association algorithm for quantum switches.
It achieves a near-optimal (within 5%) performance in terms of the percentage of served requests.
It is shown to be scalable and maintain its near-optimal performance even when the size of the QCN increases.
arXiv Detail & Related papers (2023-05-22T03:39:18Z) - RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question
Answering [87.18962441714976]
We introduce RoMQA, the first benchmark for robust, multi-evidence, multi-answer question answering (QA)
We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is challenging.
Our results show that RoMQA is a challenging benchmark for large language models, and provides a quantifiable test to build more robust QA methods.
arXiv Detail & Related papers (2022-10-25T21:39:36Z) - PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them [70.09741980324912]
Open-domain Question Answering models which directly leverage question-answer (QA) pairs show promise in terms of speed and memory.
We introduce a new QA-pair retriever, RePAQ, to complement PAQ.
We find that PAQ preempts and caches test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read models.
arXiv Detail & Related papers (2021-02-13T23:43:45Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.