Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention
- URL: http://arxiv.org/abs/2408.08454v2
- Date: Wed, 28 Aug 2024 08:31:28 GMT
- Title: Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention
- Authors: Zohaib Khan, Muhammad Khaquan, Omer Tafveez, Burhanuddin Samiwala, Agha Ali Raza,
- Abstract summary: Transformer architecture has revolutionized deep learning through its Self-Attention mechanism.
Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads.
We introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping.
- Score: 3.3457276841127315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Transformer architecture has revolutionized deep learning through its Self-Attention mechanism, which effectively captures contextual information. However, the memory footprint of Self-Attention presents significant challenges for long-sequence tasks. Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads - reducing the number of overall parameters and memory requirements in a flexible manner without adversely compromising model accuracy. In this work, we introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping: Key-Distributed GQA (KDGQA) and Dynamic Key-Distributed GQA (DGQA), which leverage information from the norms of the key heads to inform query allocation. Specifically, KDGQA looks at the ratios of the norms of the key heads during each forward pass, while DGQA examines the ratios of the norms as they evolve through training. Additionally, we present Perturbed GQA (PGQA) as a case-study, which introduces variability in (static) group formation via subtracting noise from the attention maps. Our experiments with up-trained Vision Transformers, for Image Classification on datasets such as CIFAR-10, CIFAR-100, Food101, and Tiny ImageNet, demonstrate the promise of these variants in improving upon the original GQA through more informed and adaptive grouping mechanisms: specifically ViT-L experiences accuracy gains of up to 8% when utilizing DGQA in comparison to GQA and other variants. We further analyze the impact of the number of Key-Value Heads on performance, underscoring the importance of utilizing query-key affinities. Code is available on GitHub.
Related papers
- Boosting CLIP Adaptation for Image Quality Assessment via Meta-Prompt Learning and Gradient Regularization [55.09893295671917]
This paper introduces a novel Gradient-Regulated Meta-Prompt IQA Framework (GRMP-IQA)
The GRMP-IQA comprises two key modules: Meta-Prompt Pre-training Module and Quality-Aware Gradient Regularization.
Experiments on five standard BIQA datasets demonstrate the superior performance to the state-of-the-art BIQA methods under limited data setting.
arXiv Detail & Related papers (2024-09-09T07:26:21Z) - QCQA: Quality and Capacity-aware grouped Query Attention [5.121164018825873]
Excessive memory requirements of key and value features (KV-cache) present significant challenges in the autoregressive inference of large language models (LLMs)
We propose Quality and Capacity-Aware Grouped Query Attention (QCQA) which identifies optimal query head groupings using an evolutionary algorithm with a computationally efficient and inexpensive fitness function.
arXiv Detail & Related papers (2024-06-08T07:49:55Z) - Advancing Vision Transformers with Group-Mix Attention [59.585623293856735]
Group-Mix Attention (GMA) is an advanced replacement for traditional self-attention.
GMA simultaneously captures token-to-token, token-to-group, and group-to-group correlations with various group sizes.
GroupMixFormer achieves state-of-the-art performance in image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-11-26T01:25:03Z) - VQA-GEN: A Visual Question Answering Benchmark for Domain Generalization [15.554325659263316]
Visual question answering (VQA) models are designed to demonstrate visual-textual reasoning capabilities.
Existing domain generalization datasets for VQA exhibit a unilateral focus on textual shifts.
We propose VQA-GEN, the first ever multi-modal benchmark dataset for distribution shift generated through a shift induced pipeline.
arXiv Detail & Related papers (2023-11-01T19:43:56Z) - Gait Recognition in the Wild: A Large-scale Benchmark and NAS-based
Baseline [95.88825497452716]
Gait benchmarks empower the research community to train and evaluate high-performance gait recognition systems.
GREW is the first large-scale dataset for gait recognition in the wild.
SPOSGait is the first NAS-based gait recognition model.
arXiv Detail & Related papers (2022-05-05T14:57:39Z) - VTAMIQ: Transformers for Attention Modulated Image Quality Assessment [0.0]
We propose a novel full-reference IQA method, Vision Transformer for Attention Modulated Image Quality (VTAMIQ)
Our method achieves competitive or state-of-the-art performance on the existing IQA datasets.
With large-scale pre-training for both classification and IQA tasks, VTAMIQ generalizes well to unseen sets of images and distortions.
arXiv Detail & Related papers (2021-10-04T18:35:29Z) - EQG-RACE: Examination-Type Question Generation [21.17100754955864]
We propose an innovative Examination-type Question Generation approach (EQG-RACE) to generate exam-like questions based on a dataset extracted from RACE.
Two main strategies are employed in EQG-RACE for dealing with discrete answer information and reasoning among long contexts.
Experimental results show a state-of-the-art performance of EQG-RACE, which is apparently superior to the baselines.
arXiv Detail & Related papers (2020-12-11T03:52:17Z) - Contrast and Classify: Training Robust VQA Models [60.80627814762071]
We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses.
We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
arXiv Detail & Related papers (2020-10-13T00:23:59Z) - Social Adaptive Module for Weakly-supervised Group Activity Recognition [143.68241396839062]
This paper presents a new task named weakly-supervised group activity recognition (GAR)
It differs from conventional GAR tasks in that only video-level labels are available, yet the important persons within each frame are not provided even in the training data.
This eases us to collect and annotate a large-scale NBA dataset and thus raise new challenges to GAR.
arXiv Detail & Related papers (2020-07-18T16:40:55Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.