RWKV-Lite: Deeply Compressed RWKV for Resource-Constrained Devices
- URL: http://arxiv.org/abs/2412.10856v3
- Date: Fri, 31 Jan 2025 06:34:12 GMT
- Title: RWKV-Lite: Deeply Compressed RWKV for Resource-Constrained Devices
- Authors: Wonkyo Choe, Yangfeng Ji, Felix Xiaozhu Lin,
- Abstract summary: We propose a suite of compression techniques, ranging from model architecture optimizations to post-training compression, tailored to the RWKV architecture.<n>Our techniques reduce the memory footprint of RWKV models by 3.4x -- 5x with only negligible degradation in accuracy.
- Score: 15.969537866628517
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To deploy LLMs on resource-contained platforms such as mobile robots and smartphones, non-transformers LLMs have achieved major breakthroughs. Recently, a novel RNN-based LLM family, Repentance Weighted Key Value (RWKV) has shown strong computational efficiency; nevertheless, RWKV models still have high parameter counts which limited their deployment. In this paper, we propose a suite of compression techniques, ranging from model architecture optimizations to post-training compression, tailored to the RWKV architecture. Combined, our techniques reduce the memory footprint of RWKV models by 3.4x -- 5x with only negligible degradation in accuracy; compared to transformer LLMs with similar accuracy, our models require 4x less memory footprint.
Related papers
- KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding [72.12756830560217]
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI.<n>Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value cache during inference has emerged as a primary efficiency bottleneck.<n>By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed.
arXiv Detail & Related papers (2025-07-15T12:52:12Z) - RWKVQuant: Quantizing the RWKV Family with Proxy Guided Hybrid of Scalar and Vector Quantization [10.42496371916904]
RWKV is a modern RNN architecture with comparable performance to Transformer, but still faces challenges when deployed to resource-constrained devices.<n>We propose RWKVQuant, a PTQ framework tailored for RWKV models, consisting of two novel techniques.<n> Experiments show that RWKVQuant can quantize RWKV-6-14B into about 3-bit with less than 1% accuracy loss and 2.14x speed up.
arXiv Detail & Related papers (2025-05-02T08:47:49Z) - Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning [54.584665518334035]
Hybrid architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance.
Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost.
We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities.
arXiv Detail & Related papers (2025-04-15T17:26:29Z) - FlexiGPT: Pruning and Extending Large Language Models with Low-Rank Weight Sharing [59.12511498024836]
We present a method to prune large language models (LLMs) that selectively prunes model blocks based on an importance score.
We propose a principled metric to replace each pruned block using a weight-sharing mechanism.
Empirical evaluations demonstrate substantial performance gains over existing methods.
arXiv Detail & Related papers (2025-01-24T18:46:37Z) - A Survey of RWKV [16.618320854505786]
Receptance Weighted Key Value (RWKV) model offers a novel alternative to the Transformer architecture.
Unlike conventional Transformers, which depend heavily on self-attention, RWKV adeptly captures long-range dependencies with minimal computational demands.
This paper seeks to fill this gap as the first comprehensive review of the RWKV architecture, its core principles, and its varied applications.
arXiv Detail & Related papers (2024-12-19T13:39:24Z) - KVDirect: Distributed Disaggregated LLM Inference [6.609725967999848]
Large Language Models (LLMs) have become the new foundation for many applications, reshaping human society like a storm.
Disaggregated inference, which separates prefill and decode stages, is a promising approach to improving hardware utilization and service quality.
This paper introduces KVDirect, which optimize KV cache transfer to enable a distributed disaggregated LLM inference.
arXiv Detail & Related papers (2024-12-13T21:54:16Z) - EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation [84.70637613266835]
We re-formulate the model compression problem into the customized compensation problem.
We propose Training-free Eigenspace Low-Rank Approximation (EoRA)
EoRA directly minimizes compression-induced errors without requiring gradient-based training.
arXiv Detail & Related papers (2024-10-28T17:59:03Z) - SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching [32.4599581528901]
"Two-tower" architecture is used for compressing pre-trained LLM parameters into compact representations and fine-tuning the additive full-precision adapter.
We propose SpaLLM (Sketched Adapting of LLMs), a novel compressive adaptation approach for LLMs.
We show that SpaLLM sketches pre-trained LLM weights into lookup tables and directly fine-tunes the values in these tables.
arXiv Detail & Related papers (2024-10-08T20:58:24Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Language Models as Zero-shot Lossless Gradient Compressors: Towards
General Neural Parameter Prior Models [66.1595537904019]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.
We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - Tensor Train Low-rank Approximation (TT-LoRA): Democratizing AI with Accelerated LLMs [1.5503410315996757]
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing (NLP) tasks.
However, the ever-growing complexity of LLMs demands immense computational resources.
This paper introduces Train Low-Rank Approximation (TT-LoRA), a novel parameter-efficient fine-tuning (PEFT) approach.
arXiv Detail & Related papers (2024-08-02T04:45:58Z) - Effectively Compress KV Heads for LLM [28.0801697946958]
We propose a novel approach for compressing Key-Value ( KV) caches.
Our method can compress half or even three-quarters of KV heads while maintaining performance comparable to the original LLMs.
arXiv Detail & Related papers (2024-06-11T08:37:33Z) - LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models [27.795088366122297]
We introduce LiteVAE, a new autoencoder design for latent diffusion models (LDMs)
LiteVAE uses the 2D discrete wavelet transform to enhance scalability and computational efficiency over standard variational autoencoders (VAEs) with no sacrifice in output quality.
arXiv Detail & Related papers (2024-05-23T12:06:00Z) - Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization [40.15915011575071]
Low-rank compression is a promising technique to reduce non-essential parameters in large language models.
We conduct empirical research on the low-rank characteristics of large models.
We propose a low-rank compression method suitable for large language models.
arXiv Detail & Related papers (2024-05-17T08:27:12Z) - Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures [96.00848293994463]
This paper introduces Vision-RWKV, a model adapted from the RWKV model used in the NLP field.
Our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities.
Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage.
arXiv Detail & Related papers (2024-03-04T18:46:20Z) - Data-freeWeight Compress and Denoise for Large Language Models [101.53420111286952]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.
We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z) - Low-Rank Prune-And-Factorize for Language Model Compression [18.088550230146247]
Matrix factorization fails to retain satisfactory performance under moderate to high compression rate.
We propose two techniques: sparsity-aware SVD and mixed-rank fine-tuning.
arXiv Detail & Related papers (2023-06-25T07:38:43Z) - SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression [76.73007709690306]
We introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique.
SpQR achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs.
This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup.
arXiv Detail & Related papers (2023-06-05T17:53:28Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.