Related papers: Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention

Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention

URL: http://arxiv.org/abs/2408.08454v2
Date: Wed, 28 Aug 2024 08:31:28 GMT
Title: Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention
Authors: Zohaib Khan, Muhammad Khaquan, Omer Tafveez, Burhanuddin Samiwala, Agha Ali Raza,
Abstract summary: Transformer architecture has revolutionized deep learning through its Self-Attention mechanism. Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads. We introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping.
Score: 3.3457276841127315
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Transformer architecture has revolutionized deep learning through its Self-Attention mechanism, which effectively captures contextual information. However, the memory footprint of Self-Attention presents significant challenges for long-sequence tasks. Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads - reducing the number of overall parameters and memory requirements in a flexible manner without adversely compromising model accuracy. In this work, we introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping: Key-Distributed GQA (KDGQA) and Dynamic Key-Distributed GQA (DGQA), which leverage information from the norms of the key heads to inform query allocation. Specifically, KDGQA looks at the ratios of the norms of the key heads during each forward pass, while DGQA examines the ratios of the norms as they evolve through training. Additionally, we present Perturbed GQA (PGQA) as a case-study, which introduces variability in (static) group formation via subtracting noise from the attention maps. Our experiments with up-trained Vision Transformers, for Image Classification on datasets such as CIFAR-10, CIFAR-100, Food101, and Tiny ImageNet, demonstrate the promise of these variants in improving upon the original GQA through more informed and adaptive grouping mechanisms: specifically ViT-L experiences accuracy gains of up to 8% when utilizing DGQA in comparison to GQA and other variants. We further analyze the impact of the number of Key-Value Heads on performance, underscoring the importance of utilizing query-key affinities. Code is available on GitHub.

Related papers

Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking [56.27361644734853]
Knowledge Graph Question Answering systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning.<n>Despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues.<n>We introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls.<n>Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.
arXiv Detail & Related papers (2025-05-29T14:44:52Z)
Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers [3.7132788234059104]
We present causal head gating (CHG), a scalable method for interpreting the functional roles of attention heads in transformer models.<n>CHG learns soft gates over heads and assigns them a causal taxonomy based on their impact on task performance.<n>We show that CHG scores yield causal - not merely correlational - insight, validated via ablation and causal mediation analyses.
arXiv Detail & Related papers (2025-05-19T21:24:13Z)
Dynamic-KGQA: A Scalable Framework for Generating Adaptive Question Answering Datasets [9.785129730843435]
We introduce Dynamic-KGQA, a scalable framework for generating adaptive QA datasets from knowledge graphs. Unlike fixed benchmarks, Dynamic-KGQA generates a new dataset variant on every run while preserving the underlying distribution. Dynamic-KGQA produces compact, semantically coherent subgraphs that facilitate both training and evaluation of KGQA models.
arXiv Detail & Related papers (2025-03-06T23:58:01Z)
First Token Probability Guided RAG for Telecom Question Answering [15.854941373238226]
Retrieval-Augmented Generation (RAG) has shown a distinct advantage in incorporating domain-specific information into Large Language Models (LLMs) We propose a novel first token probability guided RAG framework to tackle the challenges of Multiple Choice Question Answering (MCQA) in telecommunications.
arXiv Detail & Related papers (2025-01-11T07:47:31Z)
Boosting CLIP Adaptation for Image Quality Assessment via Meta-Prompt Learning and Gradient Regularization [55.09893295671917]
This paper introduces a novel Gradient-Regulated Meta-Prompt IQA Framework (GRMP-IQA) The GRMP-IQA comprises two key modules: Meta-Prompt Pre-training Module and Quality-Aware Gradient Regularization. Experiments on five standard BIQA datasets demonstrate the superior performance to the state-of-the-art BIQA methods under limited data setting.
arXiv Detail & Related papers (2024-09-09T07:26:21Z)
QCQA: Quality and Capacity-aware grouped Query Attention [5.121164018825873]
Excessive memory requirements of key and value features (KV-cache) present significant challenges in the autoregressive inference of large language models (LLMs) We propose Quality and Capacity-Aware Grouped Query Attention (QCQA) which identifies optimal query head groupings using an evolutionary algorithm with a computationally efficient and inexpensive fitness function.
arXiv Detail & Related papers (2024-06-08T07:49:55Z)
Advancing Vision Transformers with Group-Mix Attention [59.585623293856735]
Group-Mix Attention (GMA) is an advanced replacement for traditional self-attention. GMA simultaneously captures token-to-token, token-to-group, and group-to-group correlations with various group sizes. GroupMixFormer achieves state-of-the-art performance in image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-11-26T01:25:03Z)
VQA-GEN: A Visual Question Answering Benchmark for Domain Generalization [15.554325659263316]
Visual question answering (VQA) models are designed to demonstrate visual-textual reasoning capabilities. Existing domain generalization datasets for VQA exhibit a unilateral focus on textual shifts. We propose VQA-GEN, the first ever multi-modal benchmark dataset for distribution shift generated through a shift induced pipeline.
arXiv Detail & Related papers (2023-11-01T19:43:56Z)
Gait Recognition in the Wild: A Large-scale Benchmark and NAS-based Baseline [95.88825497452716]
Gait benchmarks empower the research community to train and evaluate high-performance gait recognition systems. GREW is the first large-scale dataset for gait recognition in the wild. SPOSGait is the first NAS-based gait recognition model.
arXiv Detail & Related papers (2022-05-05T14:57:39Z)
VTAMIQ: Transformers for Attention Modulated Image Quality Assessment [0.0]
We propose a novel full-reference IQA method, Vision Transformer for Attention Modulated Image Quality (VTAMIQ) Our method achieves competitive or state-of-the-art performance on the existing IQA datasets. With large-scale pre-training for both classification and IQA tasks, VTAMIQ generalizes well to unseen sets of images and distortions.
arXiv Detail & Related papers (2021-10-04T18:35:29Z)
EQG-RACE: Examination-Type Question Generation [21.17100754955864]
We propose an innovative Examination-type Question Generation approach (EQG-RACE) to generate exam-like questions based on a dataset extracted from RACE. Two main strategies are employed in EQG-RACE for dealing with discrete answer information and reasoning among long contexts. Experimental results show a state-of-the-art performance of EQG-RACE, which is apparently superior to the baselines.
arXiv Detail & Related papers (2020-12-11T03:52:17Z)
Contrast and Classify: Training Robust VQA Models [60.80627814762071]
We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses. We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
arXiv Detail & Related papers (2020-10-13T00:23:59Z)
Social Adaptive Module for Weakly-supervised Group Activity Recognition [143.68241396839062]
This paper presents a new task named weakly-supervised group activity recognition (GAR) It differs from conventional GAR tasks in that only video-level labels are available, yet the important persons within each frame are not provided even in the training data. This eases us to collect and annotate a large-scale NBA dataset and thus raise new challenges to GAR.
arXiv Detail & Related papers (2020-07-18T16:40:55Z)
Generating Diverse and Consistent QA pairs from Contexts with Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts. Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.