IteRABRe: Iterative Recovery-Aided Block Reduction
- URL: http://arxiv.org/abs/2503.06291v1
- Date: Sat, 08 Mar 2025 17:46:01 GMT
- Title: IteRABRe: Iterative Recovery-Aided Block Reduction
- Authors: Haryo Akbarianto Wibowo, Haiyue Song, Hideki Tanaka, Masao Utiyama, Alham Fikri Aji, Raj Dabre,
- Abstract summary: IteRABRe is a simple yet effective iterative pruning method that achieves superior compression results while requiring minimal computational resources.<n>IteRABRe demonstrates particular strength in the preservation of linguistic capabilities, showing an improvement 5% over the baselines in language-related tasks.
- Score: 36.37457533156018
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large Language Models (LLMs) have grown increasingly expensive to deploy, driving the need for effective model compression techniques. While block pruning offers a straightforward approach to reducing model size, existing methods often struggle to maintain performance or require substantial computational resources for recovery. We present IteRABRe, a simple yet effective iterative pruning method that achieves superior compression results while requiring minimal computational resources. Using only 2.5M tokens for recovery, our method outperforms baseline approaches by ~3% on average when compressing the Llama3.1-8B and Qwen2.5-7B models. IteRABRe demonstrates particular strength in the preservation of linguistic capabilities, showing an improvement 5% over the baselines in language-related tasks. Our analysis reveals distinct pruning characteristics between these models, while also demonstrating preservation of multilingual capabilities.
Related papers
- You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning [20.62274005080048]
PruneNet is a novel model compression method that reformulates model pruning as a policy learning process.
It can compress the LLaMA-2-7B model in just 15 minutes, achieving over 80% retention of its zero-shot performance.
On complex multitask language understanding tasks, PruneNet demonstrates its robustness by preserving up to 80% performance of the original model.
arXiv Detail & Related papers (2025-01-25T18:26:39Z) - FlexiGPT: Pruning and Extending Large Language Models with Low-Rank Weight Sharing [59.12511498024836]
We present a method to prune large language models (LLMs) that selectively prunes model blocks based on an importance score.<n>We propose a principled metric to replace each pruned block using a weight-sharing mechanism.<n> Empirical evaluations demonstrate substantial performance gains over existing methods.
arXiv Detail & Related papers (2025-01-24T18:46:37Z) - LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization [17.190984773586745]
Current AR-based visual generation models require substantial computational resources, limiting their applicability on resource-constrained devices.
We propose efficient attention mechanism and low-bit quantization method to enhance the efficiency of VAR models while maintaining performance.
arXiv Detail & Related papers (2024-11-26T07:32:36Z) - DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization [61.492590008258986]
Large language models (LLMs) deliver impressive results but face challenges from increasing model sizes and computational costs.
We propose DRPruning, which incorporates distributionally robust optimization to restore balanced performance across domains.
arXiv Detail & Related papers (2024-11-21T12:02:39Z) - KVPruner: Structural Pruning for Faster and Memory-Efficient Large Language Models [6.919270710497231]
We propose KVPruner to improve model efficiency while maintaining performance.
Compared to the original model, KVPruner reduces runtime memory usage by 50% and boosts throughput by over 35%.
arXiv Detail & Related papers (2024-09-17T10:35:30Z) - Compact Language Models via Pruning and Knowledge Distillation [61.56557874432008]
Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch.
Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch.
arXiv Detail & Related papers (2024-07-19T21:47:57Z) - ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models [14.310720048047136]
ALPS is an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned gradient conjugate-based post-processing step.
Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency.
On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.
arXiv Detail & Related papers (2024-06-12T02:57:41Z) - xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token [108.7069350303884]
xRAG is an innovative context compression method tailored for retrieval-augmented generation.<n>xRAG seamlessly integrates document embeddings into the language model representation space.<n> Experimental results demonstrate that xRAG achieves an average improvement of over 10% across six knowledge-intensive tasks.
arXiv Detail & Related papers (2024-05-22T16:15:17Z) - Sample Less, Learn More: Efficient Action Recognition via Frame Feature
Restoration [59.6021678234829]
We propose a novel method to restore the intermediate features for two sparsely sampled and adjacent video frames.
With the integration of our method, the efficiency of three commonly used baselines has been improved by over 50%, with a mere 0.5% reduction in recognition accuracy.
arXiv Detail & Related papers (2023-07-27T13:52:42Z) - What Do Compressed Multilingual Machine Translation Models Forget? [102.50127671423752]
We show that the performance of under-represented languages drops significantly, while the average BLEU metric only slightly decreases.
We demonstrate that compression amplifies intrinsic gender and semantic biases, even in high-resource languages.
arXiv Detail & Related papers (2022-05-22T13:54:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.