Weighted Grouped Query Attention in Transformers
- URL: http://arxiv.org/abs/2407.10855v1
- Date: Mon, 15 Jul 2024 16:07:13 GMT
- Title: Weighted Grouped Query Attention in Transformers
- Authors: Sai Sena Chinnakonduru, Astarag Mohapatra,
- Abstract summary: We propose a variation of Grouped-Query Attention termed Weighted Grouped-Query Attention (WGQA)
We introduce new learnable parameters for each key and value head in the T5 decoder attention blocks, enabling the model to take a weighted average during finetuning.
Our model achieves an average of 0.53% improvement over GQA, and converges to traditional Multihead attention (MHA) with no additional overhead during inference.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The attention mechanism forms the foundational blocks for transformer language models. Recent approaches show that scaling the model achieves human-level performance. However, with increasing demands for scaling and constraints on hardware memory, the inference costs of these models remain high. To reduce the inference time, Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) were proposed in (Shazeer, 2019) and (Ainslieet al., 2023) respectively. In this paper, we propose a variation of Grouped-Query Attention, termed Weighted Grouped-Query Attention (WGQA). We introduced new learnable parameters for each key and value head in the T5 decoder attention blocks, enabling the model to take a weighted average during finetuning. Our model achieves an average of 0.53% improvement over GQA, and the performance converges to traditional Multi-head attention (MHA) with no additional overhead during inference. We evaluated the introduction of these parameters and subsequent finetuning informs the model about the grouping mechanism during training, thereby enhancing performance. Additionally, we demonstrate the scaling laws in our analysis by comparing the results between T5-small and T5-base architecture.
Related papers
- FuXi-$α$: Scaling Recommendation Model with Feature Interaction Enhanced Transformer [81.12174905444229]
Recent advancements have shown that expanding sequential recommendation models to large-scale recommendation models can be an effective strategy.
We propose a new model called FuXi-$alpha$ to address these issues.
Our model outperforms existing models, with its performance continuously improving as the model size increases.
arXiv Detail & Related papers (2025-02-05T09:46:54Z) - Tensor Product Attention Is All You Need [54.40495407154611]
Product Attention (TPA) is a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly.
TPA achieves improved model quality alongside memory efficiency.
We introduce the ProducT ATTion Transformer (T6), a new model architecture for sequence modeling.
arXiv Detail & Related papers (2025-01-11T03:37:10Z) - Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA [8.305827430948654]
We propose a low-cost method for pruning MHA models into GQA models with any compression ratio of key-value heads.
Our strategy can compress up to 87.5% of key-value heads of the LLaMA2-7B model without too much performance degradation.
arXiv Detail & Related papers (2024-12-30T03:05:45Z) - STEAM: Squeeze and Transform Enhanced Attention Module [1.3370933421481221]
We propose a graph-based approach for modeling both channel and spatial attention, utilizing concepts from multi-head graph transformers.
STEAM achieves a 2% increase in accuracy over the standard ResNet-50 model with only a meager increase in GFLOPs.
STEAM outperforms leading modules ECA and GCT in terms of accuracy while achieving a three-fold reduction in GFLOPs.
arXiv Detail & Related papers (2024-12-12T07:38:10Z) - Advancing Vision Transformers with Group-Mix Attention [59.585623293856735]
Group-Mix Attention (GMA) is an advanced replacement for traditional self-attention.
GMA simultaneously captures token-to-token, token-to-group, and group-to-group correlations with various group sizes.
GroupMixFormer achieves state-of-the-art performance in image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-11-26T01:25:03Z) - Systematic Architectural Design of Scale Transformed Attention Condenser
DNNs via Multi-Scale Class Representational Response Similarity Analysis [93.0013343535411]
We propose a novel type of analysis called Multi-Scale Class Representational Response Similarity Analysis (ClassRepSim)
We show that adding STAC modules to ResNet style architectures can result in up to a 1.6% increase in top-1 accuracy.
Results from ClassRepSim analysis can be used to select an effective parameterization of the STAC module resulting in competitive performance.
arXiv Detail & Related papers (2023-06-16T18:29:26Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z) - LongT5: Efficient Text-To-Text Transformer for Long Sequences [8.743996838160825]
We present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time.
We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.
arXiv Detail & Related papers (2021-12-15T06:35:29Z) - Scale Efficiently: Insights from Pre-training and Fine-tuning
Transformers [57.931830650323]
This paper presents scaling insights from pretraining and finetuning Transformers.
We show that aside from only the model size, model shape matters for downstream fine-tuning.
We present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality.
arXiv Detail & Related papers (2021-09-22T12:29:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.