Related papers: LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery

LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery

URL: http://arxiv.org/abs/2310.18356v2
Date: Tue, 31 Oct 2023 04:21:33 GMT
Title: LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery
Authors: Tianyi Chen, Tianyu Ding, Badal Yadav, Ilya Zharkov, Luming Liang
Abstract summary: Large Language Models (LLMs) have transformed the landscape of artificial intelligence. We introduce LoRAShear, a novel efficient approach to structurally prune LLMs and recover knowledge. LoRAShear effectively reduced footprint of LLMs by 20% with only 1.0% performance degradation.
Score: 42.018731237153446
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have transformed the landscape of artificial intelligence, while their enormous size presents significant challenges in terms of computational costs. We introduce LoRAShear, a novel efficient approach to structurally prune LLMs and recover knowledge. Given general LLMs, LoRAShear at first creates the dependency graphs over LoRA modules to discover minimally removal structures and analyze the knowledge distribution. It then proceeds progressive structured pruning on LoRA adaptors and enables inherent knowledge transfer to better preserve the information in the redundant structures. To recover the lost knowledge during pruning, LoRAShear meticulously studies and proposes a dynamic fine-tuning schemes with dynamic data adaptors to effectively narrow down the performance gap to the full models. Numerical results demonstrate that by only using one GPU within a couple of GPU days, LoRAShear effectively reduced footprint of LLMs by 20% with only 1.0% performance degradation and significantly outperforms state-of-the-arts. The source code will be available at https://github.com/microsoft/lorashear.

Related papers

How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM? [55.33467849079774]
Low-rank adaptation (LoRA) is a popular and efficient training technique for updating or domain-specific adaptation of Large Language Models. We investigate how new facts can be incorporated into the LLM using LoRA without compromising the previously learned knowledge.
arXiv Detail & Related papers (2025-02-20T12:31:03Z)
DReSS: Data-driven Regularized Structured Streamlining for Large Language Models [30.47317140878219]
Large language models (LLMs) have achieved significant progress across various domains, but their increasing scale results in high computational and memory costs. We propose a novel paradigm that first applies regularization, then prunes, and finally finetunes. By leveraging a small amount of data to regularize the components to be pruned, DReSS explicitly transfers the important information to the remaining parts of the model in advance.
arXiv Detail & Related papers (2025-01-29T14:28:11Z)
Less is More: Towards Green Code Large Language Models via Unified Structural Pruning [27.428983811427827]
We propose Flab-Pruner, an innovative unified structural pruning method that combines vocabulary, layer, and Feed-Forward Network (FFN) pruning. The results demonstrate that Flab-Pruner retains 97% of the original performance after pruning 22% of the parameters and achieves the same or even better performance after post-training.
arXiv Detail & Related papers (2024-12-20T14:13:09Z)
LoRA Unlearns More and Retains More (Student Abstract) [0.0]
PruneLoRA reduces the need for large-scale parameter updates by applying low-rank updates to the model. We leverage LoRA to selectively modify a subset of the pruned model's parameters, thereby reducing the computational cost, memory requirements and improving the model's ability to retain performance on the remaining classes.
arXiv Detail & Related papers (2024-11-16T16:47:57Z)
Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research. Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration. Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z)
Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers [16.253898272659242]
State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive. Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFNs) We show that wide and structured networks can utilize training FLOPs more efficiently, with fewer parameters and lower loss than dense models at their optimal trade-off.
arXiv Detail & Related papers (2024-06-24T08:43:21Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
A Closer Look at the Limitations of Instruction Tuning [52.587607091917214]
We show that Instruction Tuning (IT) fails to enhance knowledge or skills in large language models (LLMs) We also show that popular methods to improve IT do not lead to performance improvements over a simple LoRA fine-tuned model. Our findings reveal that responses generated solely from pre-trained knowledge consistently outperform responses by models that learn any form of new knowledge from IT on open-source datasets.
arXiv Detail & Related papers (2024-02-03T04:45:25Z)
Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning [31.036465632204663]
We introduce Chain of LoRA, an iterative optimization framework inspired by the Frank-Wolfe algorithm. We demonstrate that COLA can consistently outperform LoRA without additional computational or memory costs.
arXiv Detail & Related papers (2024-01-08T14:26:49Z)
LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin [85.16356890023582]
We propose LoRAMoE, a novelty framework that introduces several low-rank adapters (LoRA) and integrates them by using a router network. It freezes the backbone model and forces a portion of LoRAs to focus on leveraging world knowledge to solve downstream tasks. Experimental results show that, as the instruction data increases, LoRAMoE can significantly improve the ability to process downstream tasks.
arXiv Detail & Related papers (2023-12-15T17:45:06Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
Efficient shallow learning as an alternative to deep learning [0.0]
We show that the error rates of the generalized shallow LeNet architecture, consisting of only five layers, decay as a power law with the number of filters in the first convolutional layer. A power law with a similar exponent also characterizes the generalized VGG-16 architecture. Conservation law along the convolutional layers, which is the square-root of their size times their depth, is found to minimize error rates.
arXiv Detail & Related papers (2022-11-15T10:10:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.