FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models
- URL: http://arxiv.org/abs/2508.01506v1
- Date: Sat, 02 Aug 2025 22:06:46 GMT
- Title: FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models
- Authors: Zishan Shao, Yixiao Wang, Qinsi Wang, Ting Jiang, Zhixu Du, Hancheng Ye, Danyang Zhuo, Yiran Chen, Hai Li,
- Abstract summary: FlashSVD is an end-to-end rank-aware streaming inference framework for SVD-compressed large language models.<n>It cuts peak activation memory by up to 70.2% and intermediate transient memory by 75%.<n>It incurs no accuracy loss with upstream encodering compression methods, offering a practical path toward memory-constrained deployment of low-rank LLMs.
- Score: 15.244129138320782
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Singular Value Decomposition (SVD) has recently seen a surge of interest as a simple yet powerful tool for large language models (LLMs) compression, with a growing number of works demonstrating 20-80% parameter reductions at minimal accuracy loss. Previous SVD-based approaches have focused primarily on reducing the memory footprint of model weights, largely overlooking the additional activation memory overhead incurred during inference when applying truncated factors via standard dense CUDA kernels. Our experiments demonstrate that this activation overhead, scaling with sequence length and hidden dimension, prevents current SVD compression techniques from achieving any reduction in peak inference memory, thereby limiting their viability for real-world, on-device deployments. We introduce FlashSVD, a novel, end-to-end rank-aware streaming inference framework specifically designed for SVD-compressed large language models. FlashSVD can be seamlessly integrated with any model that employs SVD-based methods for parameter reduction. By fusing low-rank projection kernels directly into both the self-attention and feed-forward network (FFN) pipelines, FlashSVD avoid materializing full-size activation buffers. Instead, small tiles of the truncated factors are loaded into on-chip SRAM, multiplied and reduced on the fly, and immediately evicted, preserving high GPU occupancy and adding no extra latency. On standard encoder benchmarks (e.g., BERT-Base), FlashSVD cuts peak activation memory by up to 70.2% and intermediate transient memory by 75%, all while incur no accuracy loss with upstreaming compression methods, offering a practical path toward memory-constrained deployment of low-rank LLMs.
Related papers
- KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding [72.12756830560217]
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI.<n>Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value cache during inference has emerged as a primary efficiency bottleneck.<n>By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed.
arXiv Detail & Related papers (2025-07-15T12:52:12Z) - ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [81.81027217759433]
Large language models (LLMs) are often constrained by the excessive memory required to store the Key-Value ( KV) cache.<n>Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers.<n>We propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache.
arXiv Detail & Related papers (2025-05-30T08:49:27Z) - ResSVD: Residual Compensated SVD for Large Language Model Compression [12.539815070352116]
Large language models (LLMs) have demonstrated impressive capabilities in a wide range of downstream natural language processing tasks.<n>We propose ResSVD, a new post-training SVD-based LLM compression method.<n>We leverage the residual matrix generated during the truncation process to reduce truncation loss.
arXiv Detail & Related papers (2025-05-26T15:14:54Z) - A Universal Framework for Compressing Embeddings in CTR Prediction [68.27582084015044]
We introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings.<n>Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features.<n> Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance.
arXiv Detail & Related papers (2025-02-21T10:12:34Z) - AdaSVD: Adaptive Singular Value Decomposition for Large Language Models [84.60646883395454]
Singular Value Decomposition (SVD) has emerged as a promising compression technique for large language models (LLMs)<n>Existing SVD-based methods often struggle to effectively mitigate the errors introduced by SVD truncation.<n>We propose AdaSVD, an adaptive SVD-based LLM compression approach.
arXiv Detail & Related papers (2025-02-03T14:34:37Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - LoMA: Lossless Compressed Memory Attention [0.0]
Lossless Compressed Memory Attention (LoMA) is a novel approach to reduce memory and computational demands during autoregressive generation.
LoMA incorporates a specialized training or fine-tuning precedure alongside an autoregressive generation algorithm optimized for the compressed context.
Experimental validation has demonstrated that LoMA significantly reducing computational consumption and memory usage.
arXiv Detail & Related papers (2024-01-16T09:18:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.