Related papers: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

URL: http://arxiv.org/abs/2305.13245v3
Date: Sat, 23 Dec 2023 17:55:11 GMT
Title: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Authors: Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr\'on, Sumit Sanghai
Abstract summary: We propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute. We show that uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA.
Score: 25.154477500940626
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show that uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA.

Related papers

Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention [3.3457276841127315]
Transformer architecture has revolutionized deep learning through its Self-Attention mechanism. Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads. We introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping.
arXiv Detail & Related papers (2024-08-15T23:34:04Z)
Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions [103.20281438405111]
Multiple-choice question answering (MCQA) is a key competence of performant transformer language models. We employ vocabulary projection and activation patching methods to localize key hidden states that encode relevant information. We show that prediction of a specific answer symbol is causally attributed to a single middle layer, and specifically its multi-head self-attention mechanism.
arXiv Detail & Related papers (2024-07-21T00:10:23Z)
QCQA: Quality and Capacity-aware grouped Query Attention [5.121164018825873]
Excessive memory requirements of key and value features (KV-cache) present significant challenges in the autoregressive inference of large language models (LLMs) We propose Quality and Capacity-Aware Grouped Query Attention (QCQA) which identifies optimal query head groupings using an evolutionary algorithm with a computationally efficient and inexpensive fitness function.
arXiv Detail & Related papers (2024-06-08T07:49:55Z)
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention [19.796549720022554]
We show that it is possible to take Multi-Query Attention a step further by also sharing key and value heads between adjacent layers. We find that it is possible to reduce the size of the KV cache by another 2x while maintaining nearly the same accuracy as unmodified MQA.
arXiv Detail & Related papers (2024-05-21T17:59:29Z)
Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq Models [16.49601740473416]
We explore recipes to improve training efficiency by initializing one model from the other. Using an encoder to warm-start seq2seq training, we show that we can match task performance of a from-scratch seq2seq model.
arXiv Detail & Related papers (2023-06-14T21:41:52Z)
Modularized Zero-shot VQA with Pre-trained Models [20.674979268279728]
We propose a modularized zero-shot network that explicitly decomposes questions into sub reasoning steps and is highly interpretable. Our experiments on two VQA benchmarks under the zero-shot setting demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-05-27T05:00:14Z)
RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question Answering [87.18962441714976]
We introduce RoMQA, the first benchmark for robust, multi-evidence, multi-answer question answering (QA) We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is challenging. Our results show that RoMQA is a challenging benchmark for large language models, and provides a quantifiable test to build more robust QA methods.
arXiv Detail & Related papers (2022-10-25T21:39:36Z)
QA4QG: Using Question Answering to Constrain Multi-Hop Question Generation [54.136509061542775]
Multi-hop question generation (MQG) aims to generate complex questions which require reasoning over multiple pieces of information of the input passage. We propose a novel framework, QA4QG, a QA-augmented BART-based framework for MQG. Our results on the HotpotQA dataset show that QA4QG outperforms all state-of-the-art models.
arXiv Detail & Related papers (2022-02-14T08:16:47Z)
When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable. In order to achieve a better accuracy, we propose two lightweight modules. DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers. QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z)
Generating Diverse and Consistent QA pairs from Contexts with Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts. Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z)
Template-Based Question Generation from Retrieved Sentences for Improved Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data. We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.