RWKV-Lite: Deeply Compressed RWKV for Resource-Constrained Devices
- URL: http://arxiv.org/abs/2412.10856v3
- Date: Fri, 31 Jan 2025 06:34:12 GMT
- Title: RWKV-Lite: Deeply Compressed RWKV for Resource-Constrained Devices
- Authors: Wonkyo Choe, Yangfeng Ji, Felix Xiaozhu Lin,
- Abstract summary: We propose a suite of compression techniques, ranging from model architecture optimizations to post-training compression, tailored to the RWKV architecture.
Our techniques reduce the memory footprint of RWKV models by 3.4x -- 5x with only negligible degradation in accuracy.
- Score: 15.969537866628517
- License:
- Abstract: To deploy LLMs on resource-contained platforms such as mobile robots and smartphones, non-transformers LLMs have achieved major breakthroughs. Recently, a novel RNN-based LLM family, Repentance Weighted Key Value (RWKV) has shown strong computational efficiency; nevertheless, RWKV models still have high parameter counts which limited their deployment. In this paper, we propose a suite of compression techniques, ranging from model architecture optimizations to post-training compression, tailored to the RWKV architecture. Combined, our techniques reduce the memory footprint of RWKV models by 3.4x -- 5x with only negligible degradation in accuracy; compared to transformer LLMs with similar accuracy, our models require 4x less memory footprint.
Related papers
- Low Resource Video Super-resolution using Memory and Residual Deformable Convolutions [3.018928786249079]
Transformer-based video super-resolution (VSR) models have set new benchmarks in recent years, but their substantial computational demands make most of them unsuitable for deployment on resource-constrained devices.
We propose a novel lightweight, parameter-efficient deep residual deformable convolution network for VSR.
With just 2.3 million parameters, our model achieves state-of-the-art SSIM of 0.9175 on the REDS4 dataset.
arXiv Detail & Related papers (2025-02-03T20:46:15Z) - FlexiGPT: Pruning and Extending Large Language Models with Low-Rank Weight Sharing [59.12511498024836]
We present a method to prune large language models (LLMs) that selectively prunes model blocks based on an importance score.
We propose a principled metric to replace each pruned block using a weight-sharing mechanism.
Empirical evaluations demonstrate substantial performance gains over existing methods.
arXiv Detail & Related papers (2025-01-24T18:46:37Z) - EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation [79.56709262189953]
EoRA consistently outperforms previous methods in compensating errors for compressed LLaMA2/3 models on various tasks.
EoRA offers a scalable, training-free solution to compensate for compression errors.
arXiv Detail & Related papers (2024-10-28T17:59:03Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients [86.40635601953446]
We study the emergence of low-rank structures across different layers of Modern Large Language Models.
We present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE.
arXiv Detail & Related papers (2024-07-15T21:05:20Z) - Effectively Compress KV Heads for LLM [28.0801697946958]
We propose a novel approach for compressing Key-Value ( KV) caches.
Our method can compress half or even three-quarters of KV heads while maintaining performance comparable to the original LLMs.
arXiv Detail & Related papers (2024-06-11T08:37:33Z) - LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models [27.795088366122297]
We introduce LiteVAE, a new autoencoder design for latent diffusion models (LDMs)
LiteVAE uses the 2D discrete wavelet transform to enhance scalability and computational efficiency over standard variational autoencoders (VAEs) with no sacrifice in output quality.
arXiv Detail & Related papers (2024-05-23T12:06:00Z) - SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression [14.818355326032538]
We propose SVD-LLM, a new SVD-based compression method for Large Language Models (LLMs)
SVD-LLM incorporates a truncation-aware data whitening strategy to ensure a direct mapping between singular values and compression loss.
Our results demonstrate the superiority of SVD-LLM over state-of-the-arts, especially at high model compression ratios.
arXiv Detail & Related papers (2024-03-12T07:31:18Z) - Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like
Architectures [99.20299078655376]
This paper introduces Vision-RWKV, a model adapted from the RWKV model used in the NLP field.
Our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities.
Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage.
arXiv Detail & Related papers (2024-03-04T18:46:20Z) - SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression [76.73007709690306]
We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique.
SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs.
This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
arXiv Detail & Related papers (2023-06-05T17:53:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.