Related papers: Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM

Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM

URL: http://arxiv.org/abs/2510.05544v1
Date: Tue, 07 Oct 2025 03:07:47 GMT
Title: Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM
Authors: Ryan Solgi, Parsa Madinei, Jiayi Tian, Rupak Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang,
Abstract summary: Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment.<n>We present a novel low-rank compression framework to address this challenge.
Score: 11.762499172999886
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment. We present a novel low-rank compression framework to address this challenge. First, we upper bound the change of network loss via layer-wise activation-based compression errors, filling a theoretical gap in the literature. We then formulate low-rank model compression as a bi-objective optimization and prove that a single uniform tolerance yields surrogate Pareto-optimal heterogeneous ranks. Based on our theoretical insights, we propose Pareto-Guided Singular Value Decomposition (PGSVD), a zero-shot pipeline that improves activation-aware compression via Pareto-guided rank selection and alternating least-squares implementation. We apply PGSVD to both LLM and VLM, showing better accuracy at the same compression levels and inference speedup.

Related papers

SAES-SVD: Self-Adaptive Suppression of Accumulated and Local Errors for SVD-based LLM Compression [27.258302662888166]
SAES-SVD is a low-rank compression framework for large language models.<n>It jointly optimize intra-layer reconstruction and inter-layer error compensation.<n>Experiments show that SAES-SVD consistently improves post-compression performance.
arXiv Detail & Related papers (2026-02-03T03:23:10Z)
MTC-VAE: Multi-Level Temporal Compression with Content Awareness [54.85288415164888]
Latent Video Diffusion Models (LVDMs) rely on Variational Autoencoders (VAEs) to compress videos into compact latent representations.<n>We present a technique to convert fixed compression rate VAEs into models that support multi-level temporal compression.
arXiv Detail & Related papers (2026-02-01T17:08:02Z)
Rethinking Autoregressive Models for Lossless Image Compression via Hierarchical Parallelism and Progressive Adaptation [75.58269386927076]
Autoregressive (AR) models are often dismissed as impractical due to prohibitive computational cost.<n>This work re-thinks this paradigm, introducing a framework built on hierarchical parallelism and progressive adaptation.<n> Experiments on diverse datasets (natural, satellite, medical) validate that our method achieves new state-of-the-art compression.
arXiv Detail & Related papers (2025-11-14T06:27:58Z)
Semantic Retention and Extreme Compression in LLMs: Can We Have Both? [0.0]
Large Language Model (LLM) deployment has intensified the need for efficient model compression techniques.<n>We show how strategically combining pruning and quantization could yield superior performance-to-compression ratios.<n>We introduce the Semantic Retention Compression Rate (SrCr), a novel metric that quantifies the trade-off between model compression and semantic preservation.
arXiv Detail & Related papers (2025-05-12T07:23:19Z)
Choose Your Model Size: Any Compression by a Single Gradient Descent [9.074689052563878]
We present Any Compression via Iterative Pruning (ACIP)<n>ACIP is an algorithmic approach to determine a compression-performance trade-off from a single gradient descent run.<n>We show that ACIP seamlessly complements common quantization-based compression techniques.
arXiv Detail & Related papers (2025-02-03T18:40:58Z)
CALLIC: Content Adaptive Learning for Lossless Image Compression [64.47244912937204]
CALLIC sets a new state-of-the-art (SOTA) for learned lossless image compression.<n>We propose a content-aware autoregressive self-attention mechanism by leveraging convolutional gating operations.<n>During encoding, we decompose pre-trained layers, including depth-wise convolutions, using low-rank matrices and then adapt the incremental weights on testing image by Rate-guided Progressive Fine-Tuning (RPFT)<n>RPFT fine-tunes with gradually increasing patches that are sorted in descending order by estimated entropy, optimizing learning process and reducing adaptation time.
arXiv Detail & Related papers (2024-12-23T10:41:18Z)
Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models [49.970828419830355]
We introduce a new post-training compression paradigm for Large Language Models (LLMs)<n>We propose a training-free approach called Activation-aware Singular Value Decomposition (ASVD)
arXiv Detail & Related papers (2023-12-10T08:41:24Z)
Lightweight Attribute Localizing Models for Pedestrian Attribute Recognition [13.480231032159834]
We propose a novel approach for determining the optimal ranks of low-rank layers, ensuring that the gradient direction of the compressed model closely aligns with that of the original model.<n>This means that the compressed model effectively preserves the update direction of the full model, enabling more efficient compression for Pedestrian Attribute Recognition tasks.
arXiv Detail & Related papers (2023-06-16T13:07:13Z)
Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z)
Linear Convergent Decentralized Optimization with Compression [50.44269451541387]
Existing decentralized algorithms with compression mainly focus on compressing DGD-type algorithms. Motivated by primal-dual algorithms, this paper proposes first underlineLinunderlineEAr convergent. underlineDecentralized with compression, LEAD.
arXiv Detail & Related papers (2020-07-01T04:35:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.