Related papers: IMPACT: Importance-Aware Activation Space Reconstruction

IMPACT: Importance-Aware Activation Space Reconstruction

URL: http://arxiv.org/abs/2507.03828v1
Date: Fri, 04 Jul 2025 22:26:33 GMT
Title: IMPACT: Importance-Aware Activation Space Reconstruction
Authors: Md Mokarram Chowdhury, Daniel Agyei Asante, Ernie Chang, Yang Li,
Abstract summary: Large language models (LLMs) achieve strong performance across many domains but are difficult to deploy in resource-constrained settings due to their size.<n>We propose IMPACT, a principled framework for importance-aware activation reconstruction that links model compression decisions to their impact on model behavior.<n>Experiments across diverse models and tasks show that IMPACT achieves up to 48.6% greater model size reduction with accuracy comparable to state-of-the-art baselines.
Score: 5.487612141214714
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) achieve strong performance across many domains but are difficult to deploy in resource-constrained settings due to their size. Low-rank weight matrix compression is a popular strategy for reducing model size, typically by minimizing weight reconstruction error under the assumption that weights are low-rank. However, this assumption often does not hold in LLMs. Instead, LLM activations exhibit stronger low-rank structure-prompting a shift toward minimizing activation reconstruction error. We show that this shift alone is insufficient: activation dimensions contribute unequally to model performance, and uniform reconstruction can harm performance. We propose IMPACT, a principled framework for importance-aware activation reconstruction that links model compression decisions to their impact on model behavior. IMPACT formulates an optimization problem that considers both activation structure and gradient sensitivity, and derives a closed-form solution where the optimal reconstruction bases are the eigenvectors of an importance-weighted activation covariance matrix. This enables low-rank approximations explicitly optimized to preserve accuracy. Experiments across diverse models and tasks show that IMPACT achieves up to 48.6% greater model size reduction with accuracy comparable to state-of-the-art baselines.

Related papers

Assigning Distinct Roles to Quantized and Low-Rank Matrices Toward Optimal Weight Decomposition [4.119890956388359]
We introduce Outlier-Driven Low-Rank Initialization (ODLRI) which assigns low-rank components the specific role of capturing activation-sensitive weights.<n>Experiments on Llama2 (7B, 13B, 70B), Llama3-8B, and Mistral-7B demonstrate that ODLRI consistently reduces activation-aware error, minimizes quantization scale, and improves perplexity and zero-shot accuracy in low-bit settings.
arXiv Detail & Related papers (2025-06-02T09:15:13Z)
FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression [15.784158079414235]
FLAT-LLM is a training-free structural compression method based on fine-grained low-rank transformations in the activation space.<n>It achieves efficient and effective weight compression without recovery fine-tuning, which could complete the calibration within a few minutes.
arXiv Detail & Related papers (2025-05-29T19:42:35Z)
Weight Spectra Induced Efficient Model Adaptation [54.8615621415845]
Fine-tuning large-scale foundation models incurs prohibitive computational costs.<n>We show that fine-tuning predominantly amplifies the top singular values while leaving the remainder largely intact.<n>We propose a novel method that leverages learnable rescaling of top singular directions.
arXiv Detail & Related papers (2025-05-29T05:03:29Z)
LatentLLM: Attention-Aware Joint Tensor Compression [50.33925662486034]
Large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources.<n>We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure.
arXiv Detail & Related papers (2025-05-23T22:39:54Z)
Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.<n>We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z)
CURing Large Models: Compression via CUR Decomposition [1.1510009152620668]
We introduce CURing, a novel model compression method based on CUR matrix decomposition.<n>By identifying and retaining informative rows and columns, CURing significantly reduces model size with minimal performance loss.<n>For example, it reduces Llama3.1-8B's parameters to 7.32B (-9%) in just 129 seconds, over 20 times faster than prior compression methods.
arXiv Detail & Related papers (2025-01-08T01:11:17Z)
DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models [33.4538652558253]
Low-rank adaptation (LoRA) reduces the computational and memory demands of fine-tuning large language models (LLMs) by approximating updates with low-rank matrices.<n>We propose Weight-Decomposed Adaptation (DoTA), which leverages the Matrix Product Operator (MPO) decomposition of pre-trained weights.<n>We also introduce QDoTA, a quantized version of DoTA designed for 4-bit quantization.
arXiv Detail & Related papers (2024-12-30T12:00:47Z)
From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications [85.17672240603011]
We study the non-uniform low-rank properties of weight matrices in Large Language Models.<n>We present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning into one.
arXiv Detail & Related papers (2024-07-15T21:05:20Z)
Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations [21.229296254354878]
We introduce a task-agnostic structured pruning approach coupled with a compact Transformer architecture design. The proposed approach, named TransAct, reduces transitional activations inside multi-head attention (MHA) and multi-layer perceptron (MLP) modules. Results verify the optimality of our approach at high compression with respect to both efficiency and performance.
arXiv Detail & Related papers (2024-07-08T07:45:38Z)
TRAWL: Tensor Reduced and Approximated Weights for Large Language Models [11.064868044313855]
We introduce TRAWL (Tensor Reduced and Approximated Weights for Large Language Models), a technique that applies tensor decomposition across multiple weight matrices to effectively denoise LLMs by capturing global structural patterns.<n>Our experiments show that TRAWL improves model performance by up to 16% over baseline models on benchmark datasets, without requiring additional data, training, or fine-tuning.
arXiv Detail & Related papers (2024-06-25T04:01:32Z)
Data-free Weight Compress and Denoise for Large Language Models [96.68582094536032]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.<n>We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z)
Language model compression with weighted low-rank factorization [73.61874728240568]
We introduce Fisher information to weigh the importance of parameters affecting the model prediction. We find that our resulting task accuracy is much closer to the original model's performance. Our method can directly compress a task-specific model while achieving better performance than other compact model strategies.
arXiv Detail & Related papers (2022-06-30T21:57:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.