Weighted Grouped Query Attention in Transformers
- URL: http://arxiv.org/abs/2407.10855v1
- Date: Mon, 15 Jul 2024 16:07:13 GMT
- Title: Weighted Grouped Query Attention in Transformers
- Authors: Sai Sena Chinnakonduru, Astarag Mohapatra,
- Abstract summary: We propose a variation of Grouped-Query Attention termed Weighted Grouped-Query Attention (WGQA)
We introduce new learnable parameters for each key and value head in the T5 decoder attention blocks, enabling the model to take a weighted average during finetuning.
Our model achieves an average of 0.53% improvement over GQA, and converges to traditional Multihead attention (MHA) with no additional overhead during inference.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The attention mechanism forms the foundational blocks for transformer language models. Recent approaches show that scaling the model achieves human-level performance. However, with increasing demands for scaling and constraints on hardware memory, the inference costs of these models remain high. To reduce the inference time, Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) were proposed in (Shazeer, 2019) and (Ainslieet al., 2023) respectively. In this paper, we propose a variation of Grouped-Query Attention, termed Weighted Grouped-Query Attention (WGQA). We introduced new learnable parameters for each key and value head in the T5 decoder attention blocks, enabling the model to take a weighted average during finetuning. Our model achieves an average of 0.53% improvement over GQA, and the performance converges to traditional Multi-head attention (MHA) with no additional overhead during inference. We evaluated the introduction of these parameters and subsequent finetuning informs the model about the grouping mechanism during training, thereby enhancing performance. Additionally, we demonstrate the scaling laws in our analysis by comparing the results between T5-small and T5-base architecture.
Related papers
- SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention [3.3457276841127315]
Transformer architecture has revolutionized deep learning through its Self-Attention mechanism.
Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads.
We introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping.
arXiv Detail & Related papers (2024-08-15T23:34:04Z) - Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models [102.72940700598055]
In reasoning tasks, even a minor error can cascade into inaccurate results.
We develop a method that avoids introducing external resources, relying instead on perturbations to the input.
Our training approach randomly masks certain tokens within the chain of thought, a technique we found to be particularly effective for reasoning tasks.
arXiv Detail & Related papers (2024-03-04T16:21:54Z) - Advancing Vision Transformers with Group-Mix Attention [59.585623293856735]
Group-Mix Attention (GMA) is an advanced replacement for traditional self-attention.
GMA simultaneously captures token-to-token, token-to-group, and group-to-group correlations with various group sizes.
GroupMixFormer achieves state-of-the-art performance in image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-11-26T01:25:03Z) - Systematic Architectural Design of Scale Transformed Attention Condenser
DNNs via Multi-Scale Class Representational Response Similarity Analysis [93.0013343535411]
We propose a novel type of analysis called Multi-Scale Class Representational Response Similarity Analysis (ClassRepSim)
We show that adding STAC modules to ResNet style architectures can result in up to a 1.6% increase in top-1 accuracy.
Results from ClassRepSim analysis can be used to select an effective parameterization of the STAC module resulting in competitive performance.
arXiv Detail & Related papers (2023-06-16T18:29:26Z) - 4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict
decoders [29.799797974513552]
This paper proposes four-decoder joint modeling (4D) of CTC, attention, RNN-T, and mask-predict.
The four decoders are jointly trained so that they can be easily switched depending on the application scenarios.
The experimental results showed that the proposed model consistently reduced the WER.
arXiv Detail & Related papers (2022-12-21T07:15:59Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z) - LongT5: Efficient Text-To-Text Transformer for Long Sequences [8.743996838160825]
We present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time.
We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.
arXiv Detail & Related papers (2021-12-15T06:35:29Z) - Scale Efficiently: Insights from Pre-training and Fine-tuning
Transformers [57.931830650323]
This paper presents scaling insights from pretraining and finetuning Transformers.
We show that aside from only the model size, model shape matters for downstream fine-tuning.
We present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality.
arXiv Detail & Related papers (2021-09-22T12:29:15Z) - Fast Transformers with Clustered Attention [14.448898156256478]
We propose clustered attention, which instead of computing the attention for every query, groups queries into clusters and computes attention just for the centroids.
This results in a model with linear complexity with respect to the sequence length for a fixed number of clusters.
We evaluate our approach on two automatic speech recognition datasets and show that our model consistently outperforms vanilla transformers for a given computational budget.
arXiv Detail & Related papers (2020-07-09T14:17:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.