Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM
- URL: http://arxiv.org/abs/2510.05544v1
- Date: Tue, 07 Oct 2025 03:07:47 GMT
- Title: Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM
- Authors: Ryan Solgi, Parsa Madinei, Jiayi Tian, Rupak Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang,
- Abstract summary: Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment.<n>We present a novel low-rank compression framework to address this challenge.
- Score: 11.762499172999886
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment. We present a novel low-rank compression framework to address this challenge. First, we upper bound the change of network loss via layer-wise activation-based compression errors, filling a theoretical gap in the literature. We then formulate low-rank model compression as a bi-objective optimization and prove that a single uniform tolerance yields surrogate Pareto-optimal heterogeneous ranks. Based on our theoretical insights, we propose Pareto-Guided Singular Value Decomposition (PGSVD), a zero-shot pipeline that improves activation-aware compression via Pareto-guided rank selection and alternating least-squares implementation. We apply PGSVD to both LLM and VLM, showing better accuracy at the same compression levels and inference speedup.
Related papers
- SAES-SVD: Self-Adaptive Suppression of Accumulated and Local Errors for SVD-based LLM Compression [27.258302662888166]
SAES-SVD is a low-rank compression framework for large language models.<n>It jointly optimize intra-layer reconstruction and inter-layer error compensation.<n>Experiments show that SAES-SVD consistently improves post-compression performance.
arXiv Detail & Related papers (2026-02-03T03:23:10Z) - MTC-VAE: Multi-Level Temporal Compression with Content Awareness [54.85288415164888]
Latent Video Diffusion Models (LVDMs) rely on Variational Autoencoders (VAEs) to compress videos into compact latent representations.<n>We present a technique to convert fixed compression rate VAEs into models that support multi-level temporal compression.
arXiv Detail & Related papers (2026-02-01T17:08:02Z) - Rethinking Autoregressive Models for Lossless Image Compression via Hierarchical Parallelism and Progressive Adaptation [75.58269386927076]
Autoregressive (AR) models are often dismissed as impractical due to prohibitive computational cost.<n>This work re-thinks this paradigm, introducing a framework built on hierarchical parallelism and progressive adaptation.<n> Experiments on diverse datasets (natural, satellite, medical) validate that our method achieves new state-of-the-art compression.
arXiv Detail & Related papers (2025-11-14T06:27:58Z) - Semantic Retention and Extreme Compression in LLMs: Can We Have Both? [0.0]
Large Language Model (LLM) deployment has intensified the need for efficient model compression techniques.<n>We show how strategically combining pruning and quantization could yield superior performance-to-compression ratios.<n>We introduce the Semantic Retention Compression Rate (SrCr), a novel metric that quantifies the trade-off between model compression and semantic preservation.
arXiv Detail & Related papers (2025-05-12T07:23:19Z) - Choose Your Model Size: Any Compression by a Single Gradient Descent [9.074689052563878]
We present Any Compression via Iterative Pruning (ACIP)<n>ACIP is an algorithmic approach to determine a compression-performance trade-off from a single gradient descent run.<n>We show that ACIP seamlessly complements common quantization-based compression techniques.
arXiv Detail & Related papers (2025-02-03T18:40:58Z) - CALLIC: Content Adaptive Learning for Lossless Image Compression [64.47244912937204]
CALLIC sets a new state-of-the-art (SOTA) for learned lossless image compression.<n>We propose a content-aware autoregressive self-attention mechanism by leveraging convolutional gating operations.<n>During encoding, we decompose pre-trained layers, including depth-wise convolutions, using low-rank matrices and then adapt the incremental weights on testing image by Rate-guided Progressive Fine-Tuning (RPFT)<n>RPFT fine-tunes with gradually increasing patches that are sorted in descending order by estimated entropy, optimizing learning process and reducing adaptation time.
arXiv Detail & Related papers (2024-12-23T10:41:18Z) - Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization.
This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models [49.970828419830355]
We introduce a new post-training compression paradigm for Large Language Models (LLMs)<n>We propose a training-free approach called Activation-aware Singular Value Decomposition (ASVD)
arXiv Detail & Related papers (2023-12-10T08:41:24Z) - Lightweight Attribute Localizing Models for Pedestrian Attribute Recognition [13.480231032159834]
We propose a novel approach for determining the optimal ranks of low-rank layers, ensuring that the gradient direction of the compressed model closely aligns with that of the original model.<n>This means that the compressed model effectively preserves the update direction of the full model, enabling more efficient compression for Pedestrian Attribute Recognition tasks.
arXiv Detail & Related papers (2023-06-16T13:07:13Z) - Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings
We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z) - Linear Convergent Decentralized Optimization with Compression [50.44269451541387]
Existing decentralized algorithms with compression mainly focus on compressing DGD-type algorithms.
Motivated by primal-dual algorithms, this paper proposes first underlineLinunderlineEAr convergent.
underlineDecentralized with compression, LEAD.
arXiv Detail & Related papers (2020-07-01T04:35:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.