Related papers: Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization

Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization

URL: http://arxiv.org/abs/2405.10616v1
Date: Fri, 17 May 2024 08:27:12 GMT
Title: Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization
Authors: Yixin Ji, Yang Xiang, Juntao Li, Wei Chen, Zhongyi Liu, Kehai Chen, Min Zhang,
Abstract summary: Low-rank compression is a promising technique to reduce non-essential parameters in large language models. We conduct empirical research on the low-rank characteristics of large models. We propose a low-rank compression method suitable for large language models.
Score: 40.15915011575071
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, large language models (LLMs) have driven advances in natural language processing. Still, their growing scale has increased the computational burden, necessitating a balance between efficiency and performance. Low-rank compression, a promising technique, reduces non-essential parameters by decomposing weight matrices into products of two low-rank matrices. Yet, its application in LLMs has not been extensively studied. The key to low-rank compression lies in low-rank factorization and low-rank dimensions allocation. To address the challenges of low-rank compression in LLMs, we conduct empirical research on the low-rank characteristics of large models. We propose a low-rank compression method suitable for LLMs. This approach involves precise estimation of feature distributions through pooled covariance matrices and a Bayesian optimization strategy for allocating low-rank dimensions. Experiments on the LLaMA-2 models demonstrate that our method outperforms existing strong structured pruning and low-rank compression techniques in maintaining model performance at the same compression ratio.

Related papers

Globally optimized SVD compression of LLMs via Fermi-function-based rank selection and gauge fixing [0.0]
Low-rank decompositions of Large Language Models (LLMs) are very demanding in terms of their computational resources.<n>We present two physics-inspired improvements to SVD compression: textbfFermiGrad, a gradient-descent algorithm that determines globally optimal layer-wise ranks, and textbfPivGa, an additional textitlossless compression of the low-rank factors.
arXiv Detail & Related papers (2025-11-26T10:54:01Z)
1+1>2: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models [15.798945727818753]
We introduce underlineSynergistic underlineSparse and underlineCompression (SSLC) methods for Large Language Models (LLMs)<n>Low-rank approximation compresses the model by retaining its essential structure with minimal information loss, whereas sparse optimization eliminates non-essential weights, preserving those crucial for generalization.<n>Experiments on LLaMA and Qwen2.5 models (7B-70B) show that SSLC, without any additional training steps, consistently surpasses standalone methods, achieving state-of-the-arts results.
arXiv Detail & Related papers (2025-10-30T12:50:30Z)
CALR: Corrective Adaptive Low-Rank Decomposition for Efficient Large Language Model Layer Compression [0.0]
Large Language Models (LLMs) present significant deployment challenges due to their immense size and computational requirements.<n>We introduce Corrective Adaptive Low-Rank Decomposition (CALR), a two-component compression approach.<n>We show that CALR can reduce parameter counts by 26.93% to 51.77% while retaining 59.45% to 90.42% of the original model's performance.
arXiv Detail & Related papers (2025-08-21T13:16:02Z)
MGAA: Multi-Granular Adaptive Allocation fof Low-Rank Compression of LLMs [9.244526043014098]
Multi-Granular Adaptive Allocation (MGAA) method can adaptively allocate parameters between and within sublayers without task-specific evaluations in the compression process.<n> Comprehensive evaluations of MGAA across multiple LLMs backbone models and benchmark datasets demonstrate its superior performance.
arXiv Detail & Related papers (2025-07-04T04:54:01Z)
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase. Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative. We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z)
SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching [32.4599581528901]
"Two-tower" architecture is used for compressing pre-trained LLM parameters into compact representations and fine-tuning the additive full-precision adapter. We propose SpaLLM (Sketched Adapting of LLMs), a novel compressive adaptation approach for LLMs. We show that SpaLLM sketches pre-trained LLM weights into lookup tables and directly fine-tunes the values in these tables.
arXiv Detail & Related papers (2024-10-08T20:58:24Z)
LoRTA: Low Rank Tensor Adaptation of Large Language Models [70.32218116940393]
Low Rank Adaptation (LoRA) is a popular Efficient Fine Tuning (PEFT) method. We propose a higher-order Candecomp/Parafac (CP) decomposition, enabling a more compact and flexible representation. Our method can achieve a reduction in the number of parameters while maintaining comparable performance.
arXiv Detail & Related papers (2024-10-05T06:59:50Z)
Tensor Train Low-rank Approximation (TT-LoRA): Democratizing AI with Accelerated LLMs [1.5503410315996757]
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing (NLP) tasks. However, the ever-growing complexity of LLMs demands immense computational resources. This paper introduces Train Low-Rank Approximation (TT-LoRA), a novel parameter-efficient fine-tuning (PEFT) approach.
arXiv Detail & Related papers (2024-08-02T04:45:58Z)
From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients [86.40635601953446]
We study the emergence of low-rank structures across different layers of Modern Large Language Models. We present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE.
arXiv Detail & Related papers (2024-07-15T21:05:20Z)
Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z)
LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models [9.244526043014098]
Large language models (LLMs) show excellent performance in difficult tasks, but they often require massive memories and computational resources. In this study, we make an important observation that the multi-head self-attention (MHA) sub-layer of Transformer exhibits noticeable low-rank structure. We propose a mixed compression model, which organically combines Low-Rank matrix And structured Pruning (LoRAP)
arXiv Detail & Related papers (2024-04-15T11:53:22Z)
Data-freeWeight Compress and Denoise for Large Language Models [101.53420111286952]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices. We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z)
Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks. However, the massive size of these models poses huge challenges for their deployment in real-world applications. We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z)
Low-Rank Prune-And-Factorize for Language Model Compression [18.088550230146247]
Matrix factorization fails to retain satisfactory performance under moderate to high compression rate. We propose two techniques: sparsity-aware SVD and mixed-rank fine-tuning.
arXiv Detail & Related papers (2023-06-25T07:38:43Z)
Just CHOP: Embarrassingly Simple LLM Compression [27.64461490974072]
Large language models (LLMs) enable unparalleled few- and zero-shot reasoning capabilities but at a high computational footprint. We show that simple layer pruning coupled with an extended language model pretraining produces state-of-the-art results against structured and even semi-structured compression of models at a 7B scale. We also show how distillation, which has been super effective in task-agnostic compression of smaller BERT-style models, becomes inefficient against our simple pruning technique.
arXiv Detail & Related papers (2023-05-24T08:18:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.