Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Decoder-Only Transformers
- URL: http://arxiv.org/abs/2510.23912v2
- Date: Sat, 01 Nov 2025 01:55:07 GMT
- Title: Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Decoder-Only Transformers
- Authors: Marko Karbevski, Antonij Mijoski,
- Abstract summary: We prove under simplifying assumptions that the Query weights are redundant, thereby reducing the number of non-embedding/lm-head parameters by over 8%.<n>We validate the theory on full-complexity GPT-3 small architectures.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Query, Key, Value weight triplet is a building block of current attention mechanisms in state-of-the-art LLMs. We theoretically investigate whether this triplet can be reduced, proving under simplifying assumptions that the Query weights are redundant, thereby reducing the number of non-embedding/lm-head parameters by over 8%. We validate the theory on full-complexity GPT-3 small architectures (with layer normalization, skip connections, and weight decay) trained from scratch, demonstrating that the reduced model achieves comparable validation loss to standard baselines. These findings motivate the investigation of the Query weight redundancy at scale.
Related papers
- HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning [5.407724832457912]
We propose the Hessian Robust Quantization (HeRo Q) algorithm, which applies a lightweight, learnable rotation-compression matrix to the weight space prior to quantization.<n> Experiments on Llama and Qwen models show that HeRo Q consistently outperforms state of the art methods including GPTQ, AWQ, and SpinQuant.
arXiv Detail & Related papers (2026-01-29T12:27:05Z) - Intrinsic Structure as a Proxy for Saliency: SVD-Based Weight Preservation for Mixed-Precision Quantization in Large Language Models [0.0]
Post-Training Quantization (PTQ) addresses this by reducing the precision of model weights, typically to 4-bit or lower.<n>Current state-of-the-art methods rely on calibration data to identify salient weights.<n>We propose a data-free, structure-aware hypothesis: that the weights identified as Principal Components via Singular Value Decomposition (SVD) are intrinsically important to the model's downstream performance.
arXiv Detail & Related papers (2025-12-01T06:58:30Z) - TuckA: Hierarchical Compact Tensor Experts for Efficient Fine-Tuning [83.93651411533533]
We introduce Tucker Adaptation (TuckA), a method with four key properties.<n>We develop an efficient batch-level routing mechanism, which reduces the router's parameter size by a factor of $L$.<n>Experiments on benchmarks in natural language understanding, image classification, and mathematical reasoning speak to the efficacy of TuckA.
arXiv Detail & Related papers (2025-11-10T09:03:16Z) - A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA [65.38186593873313]
Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise.<n>We introduce a proof-of-concept multi-call framework for MHQA, InfoQA.<n>We construct a stringent and noise-rich benchmark to validate our theory and framework.
arXiv Detail & Related papers (2025-09-25T14:11:57Z) - LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning [50.89500210372827]
Quantization and fine-tuning are crucial for deploying large language models (LLMs) on resource-constrained edge devices.<n>LoTA-QAF is a novel fine-tuning method specifically designed for quantized LLMs.<n>On the MMLU benchmark, our method effectively recovers performance for quantized models, surpassing 16-bit LoRA by up to 5.14%.
arXiv Detail & Related papers (2025-05-24T14:47:28Z) - Quantization Meets Reasoning: Exploring and Mitigating Degradation of Low-Bit LLMs in Mathematical Reasoning [39.56908863102256]
Low-bit post-training quantization impairs mathematical reasoning up to 69.81% in harder settings.<n>We address two deployment-critical questions with process-level precision.<n>In our settings, as few as 332 curated examples and 3--5 minutes of compute on a single GPU recover 4-bit weight math reasoning toward the full-precision baseline.
arXiv Detail & Related papers (2025-05-16T12:11:40Z) - Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression [55.323397702682506]
Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining.<n>We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery.
arXiv Detail & Related papers (2025-04-10T02:19:03Z) - Does Self-Attention Need Separate Weights in Transformers? [0.8528401618469594]
This work introduces a shared weight self-attention-based BERT model that only learns one weight matrix for (Key, Value, and Query) representations.<n> Experimental results indicate that our shared self-attention method achieves a parameter size reduction of 66.53% in the attention block.<n>In the GLUE dataset, the shared weight self-attention-based BERT model demonstrates accuracy improvements of 0.38%, 5.81%, and 1.06% over the standard, symmetric, and pairwise attention-based BERT models.
arXiv Detail & Related papers (2024-11-30T04:46:20Z) - Injectivity capacity of ReLU gates [0.0]
We consider the injectivity property of the ReLU networks layers.
We develop a powerful program to handle the $ell_0$ spherical perceptron and implicitly the ReLU layers injectivity.
The obtained results are also shown to fairly closely match the replica predictions from [40]
arXiv Detail & Related papers (2024-10-28T00:57:10Z) - A Mean Field Ansatz for Zero-Shot Weight Transfer [9.910243630243079]
We introduce a mean field ansatz to provide a theoretical explanation for weight transfer.
We empirically validate the RC ansatz by exploring simple examples and LLMs such as GPT-3 and Llama-3.1.
We show the mean-field point of view is adequate under suitable assumptions which can provide theoretical support for zero-shot weight transfer.
arXiv Detail & Related papers (2024-08-16T11:53:52Z) - Data-free Weight Compress and Denoise for Large Language Models [96.68582094536032]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.<n>We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z) - Junk DNA Hypothesis: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs [66.70431182736787]
It has been believed that weights in large language models (LLMs) contain significant redundancy.<n>This paper presents a counter-argument: small-magnitude weights of pre-trained model weights encode vital knowledge essential for tackling difficult downstream tasks.
arXiv Detail & Related papers (2023-09-29T22:55:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.