A3 : an Analytical Low-Rank Approximation Framework for Attention
- URL: http://arxiv.org/abs/2505.12942v3
- Date: Wed, 25 Jun 2025 23:03:54 GMT
- Title: A3 : an Analytical Low-Rank Approximation Framework for Attention
- Authors: Jeffrey T. H. Wong, Cheng Zhang, Xinye Cao, Pedro Gimenes, George A. Constantinides, Wayne Luk, Yiren Zhao,
- Abstract summary: We propose $tt Attt 3$, a post-training low-rank approximation framework.<n>We show that $tt Attt 3$ maintains superior performance compared to SoTAs.<n>We also demonstrate the versatility of $tt Att 3$, including KV cache compression, quantization, and mixed-rank assignments for enhanced performance.
- Score: 14.649496050074735
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models have demonstrated remarkable performance; however, their massive parameter counts make deployment highly expensive. Low-rank approximation offers a promising compression solution, yet existing approaches have two main limitations: (1) They focus on minimizing the output error of individual linear layers, without considering the architectural characteristics of Transformers, and (2) they decompose a large weight matrix into two small low-rank matrices. Consequently, these methods often fall short compared to other compression techniques like pruning and quantization, and introduce runtime overhead such as the extra GEMM kernel launches for decomposed small matrices. To address these limitations, we propose $\tt A^\tt 3$, a post-training low-rank approximation framework. $\tt A^\tt 3$ splits a Transformer layer into three functional components, namely $\tt QK$, $\tt OV$, and $\tt MLP$. For each component, $\tt A^\tt 3$ provides an analytical solution that reduces the hidden dimension size inside each component while minimizing the component's functional loss ($\it i.e.$, error in attention scores, attention outputs, and MLP outputs). This approach directly reduces model sizes, KV cache sizes, and FLOPs without introducing any runtime overheads. In addition, it provides a new narrative in advancing the optimization problem from singular linear layer loss optimization toward improved end-to-end performance. Through extensive experiments, we show that $\tt A^\tt 3$ maintains superior performance compared to SoTAs. For example, under the same reduction budget in computation and memory, our low-rank approximated LLaMA 3.1-70B achieves a perplexity of 4.69 on WikiText-2, outperforming the previous SoTA's 7.87 by 3.18. We also demonstrate the versatility of $\tt A^\tt 3$, including KV cache compression, quantization, and mixed-rank assignments for enhanced performance.
Related papers
- 3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs [20.28912929805946]
We introduce 3BASiL-TM, an efficient one-shot post-training method for $(mathbfS + mathbfLR)$ decomposition of Large Language Models.<n>Our experiments show that 3BASiL-TM reduces the WikiText2 perplexity gap relative to dense LLaMA-8B model by over 30% under a (2:4 Sparse + 64 LR) configuration.<n>Our method achieves over 2.5x faster compression runtime on an A100 GPU compared to SOTA $(mathbfS + mathbfLR)
arXiv Detail & Related papers (2026-03-02T02:16:46Z) - The Structural Scalpel: Automated Contiguous Layer Pruning for Large Language Models [33.90597962418094]
We propose CLP, a novel continuous layer pruning framework for large language models.<n>CLP uses differentiable concave gate algorithm that automatically identifies the best continuous layer segments for pruning.<n>CLP can be seamlessly combined with quantization to further compress the model with only a slight performance loss.
arXiv Detail & Related papers (2025-10-25T16:40:17Z) - ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization [99.96330641363396]
ARMOR: (Adaptive Representation with Matrix-factORization) is a novel one-shot post-training pruning algorithm.<n>Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices.<n>We demonstrate ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations.
arXiv Detail & Related papers (2025-10-07T02:39:20Z) - PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z) - SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention [10.607730369798551]
We introduce input-adaptive learnable gates to selectively modulate information aggregation into the $KV$ feature map.<n>SAGA achieves a 1.76$times$ improvement in throughput and a 2.69$times$ reduction in peak GPU memory compared to PVT-T.<n>It improves top-1 accuracy by up to 4.4% on the ImageNet dataset, demonstrating both computational efficiency and model effectiveness.
arXiv Detail & Related papers (2025-09-16T08:36:05Z) - FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA [61.79405341803085]
Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in federated learning (FL)<n>Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in federated learning (FL)
arXiv Detail & Related papers (2025-05-19T07:32:56Z) - Second-order Optimization of Gaussian Splats with Importance Sampling [51.95046424364725]
3D Gaussian Splatting (3DGS) is widely used for novel view rendering due to its high quality and fast inference time.<n>We propose a novel second-order optimization strategy based on Levenberg-Marquardt (LM) and Conjugate Gradient (CG)<n>Our method achieves a $3times$ speedup over standard LM and outperforms Adam by $6times$ when the Gaussian count is low.
arXiv Detail & Related papers (2025-04-17T12:52:08Z) - Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers [18.469378618426294]
We introduce Hamming Attention Distillation (HAD), a framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains.<n>We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention.
arXiv Detail & Related papers (2025-02-03T19:24:01Z) - Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models [1.6385815610837167]
We present Pivoting Factorization (PIFA), a novel low-rank representation that unsupervisedly learns a compact form of any low-rank representation.<n>To mitigate the performance degradation caused by low-rank pruning, we introduce a novel, retraining-free low-rank reconstruction method.<n>MPIFA significantly outperforms existing low-rank pruning methods and, for the first time, achieves performance comparable to semi-structured pruning.
arXiv Detail & Related papers (2025-01-31T12:36:31Z) - TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs [58.19080159470868]
We propose a novel low-rank ZO estimator, TeZO, which captures the low-rankness across both the model and temporal dimension.<n>Specifically, we represent ZO perturbations along the temporal dimension as a 3D tensor and employ Canonical Polyadic Decomposition (CPD) to extract each low-rank 2D matrix.
arXiv Detail & Related papers (2025-01-31T11:34:03Z) - LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator [11.167930856636161]
We introduce LUT-DLA, a Look-Up Table (LUT) Deep Learning Accelerator Framework that utilizes vector quantization to convert neural network models into LUTs.<n>We show that LUT-DLA achieves improvements in power efficiency and area efficiency with gains of $1.4$$7.0times$ and $1.5$$146.1times$, respectively.
arXiv Detail & Related papers (2025-01-18T05:27:25Z) - V"Mean"ba: Visual State Space Models only need 1 hidden dimension [0.7864304771129751]
State Space Models (SSMs) have emerged as a solution by introducing a linear recurrence mechanism.<n>We introduce textitVMeanba, a training-free compression method that eliminates the channel dimension in SSMs using mean operations.<n>We show that textitVMeanba achieves up to a 1.12x speedup with less than a 3% accuracy loss.
arXiv Detail & Related papers (2024-12-21T12:27:07Z) - S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity [39.679861450783605]
We propose a family of Structured Sparse Fine-Tuning (S$2$FT) methods for LLMs.<n>S$2$FT accomplishes this by "selecting sparsely and computing densely"<n>We show that S$2$FT saves training memory up to 3$times$ and improves latency by 1.5-2.7$times$ compared to full FT.
arXiv Detail & Related papers (2024-12-09T08:24:11Z) - Demystifying Linear MDPs and Novel Dynamics Aggregation Framework [8.087699764574788]
In linear MDPs, the feature dimension $d$ is lower bounded by $S/U$ in order to aptly represent transition probabilities.
We propose a novel structural aggregation framework based on dynamics, named as the "dynamics aggregation"
Our proposed algorithm exhibits statistical efficiency, achieving a regret of $ tildeO ( d_psi3/2 H3/2sqrt T)$, where $d_psi$ represents the feature dimension of aggregated subMDPs.
arXiv Detail & Related papers (2024-10-31T16:21:41Z) - From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients [86.40635601953446]
We study the emergence of low-rank structures across different layers of Modern Large Language Models.
We present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE.
arXiv Detail & Related papers (2024-07-15T21:05:20Z) - Scalable 3D Registration via Truncated Entry-wise Absolute Residuals [65.04922801371363]
A $3$D registration approach can process more than ten million ($107$) point pairs with over $99%$ random outliers.
We call our method TEAR, as it involves minimizing an outlier-robust loss that computes Truncated Entry-wise Absolute Residuals.
arXiv Detail & Related papers (2024-04-01T04:43:39Z) - Learning Low-Rank Representations for Model Compression [6.721845345130468]
We propose a Low-Rank Representation Vector Quantization ($textLR2textVQ$) method that outperforms previous VQ algorithms in various tasks and architectures.
In our method, the compression ratio could be directly controlled by $m$, and the final accuracy is solely determined by $tilded$.
With a proper $tilded$, we evaluate $textLR2textVQ$ with ResNet-18/ResNet-50 on ImageNet classification datasets, achieving 2.8%/1.0% top
arXiv Detail & Related papers (2022-11-21T12:15:28Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Deep Learning Meets Projective Clustering [66.726500395069]
A common approach for compressing NLP networks is to encode the embedding layer as a matrix $AinmathbbRntimes d$.
Inspired by emphprojective clustering from computational geometry, we suggest replacing this subspace by a set of $k$ subspaces.
arXiv Detail & Related papers (2020-10-08T22:47:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.