Can pruning make Large Language Models more efficient?
- URL: http://arxiv.org/abs/2310.04573v1
- Date: Fri, 6 Oct 2023 20:28:32 GMT
- Title: Can pruning make Large Language Models more efficient?
- Authors: Sia Gholami, Marwan Omar
- Abstract summary: This paper investigates the application of weight pruning as an optimization strategy for Transformer architectures.
Our findings suggest that significant reductions in model size are attainable without considerable compromise on performance.
This work seeks to bridge the gap between model efficiency and performance, paving the way for more scalable and environmentally responsible deep learning applications.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Transformer models have revolutionized natural language processing with their
unparalleled ability to grasp complex contextual relationships. However, the
vast number of parameters in these models has raised concerns regarding
computational efficiency, environmental impact, and deployability on
resource-limited platforms. To address these challenges, this paper
investigates the application of weight pruning-a strategic reduction of model
parameters based on their significance-as an optimization strategy for
Transformer architectures. Through extensive experimentation, we explore
various pruning methodologies, highlighting their impact on model performance,
size, and computational demands. Our findings suggest that with judicious
selection of pruning hyperparameters, significant reductions in model size are
attainable without considerable compromise on performance. Moreover, when
coupled with post-pruning fine-tuning strategies, some pruned models even
exhibit enhanced generalization capabilities. This work seeks to bridge the gap
between model efficiency and performance, paving the way for more scalable and
environmentally responsible deep learning applications.
Related papers
- SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models [0.0]
An increase in the number of connected devices around the world warrants compressed models that can be easily deployed at the local devices with low compute capacity and power accessibility.
We implemented both, quantization and pruning, compression techniques on popular deep learning models used in the image classification, object detection, language models and generative models-based problem statements.
arXiv Detail & Related papers (2024-07-22T14:20:53Z) - Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks [5.536630285985836]
We introduce parameter-efficient sparsity crafting (PESC)
PESC crafts dense models into sparse models using the mixture-of-experts (MoE) architecture.
Our best sparse model outperforms other sparse and dense models and exhibits superior general capabilities compared to GP3.5.
arXiv Detail & Related papers (2024-01-05T09:58:09Z) - Retrieval-based Knowledge Transfer: An Effective Approach for Extreme
Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks.
However, the massive size of these models poses huge challenges for their deployment in real-world applications.
We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z) - Boosting Inference Efficiency: Unleashing the Power of Parameter-Shared
Pre-trained Language Models [109.06052781040916]
We introduce a technique to enhance the inference efficiency of parameter-shared language models.
We also propose a simple pre-training technique that leads to fully or partially shared models.
Results demonstrate the effectiveness of our methods on both autoregressive and autoencoding PLMs.
arXiv Detail & Related papers (2023-10-19T15:13:58Z) - A survey on efficient vision transformers: algorithms, techniques, and
performance benchmarking [19.65897437342896]
Vision Transformer (ViT) architectures are becoming increasingly popular and widely employed to tackle computer vision applications.
This paper mathematically defines the strategies used to make Vision Transformer efficient, describes and discusses state-of-the-art methodologies, and analyzes their performances over different application scenarios.
arXiv Detail & Related papers (2023-09-05T08:21:16Z) - Dependency Structure Search Bayesian Optimization for Decision Making Models [29.95525433889418]
We propose a compact multi-layered architecture modeling the dynamics of agent interactions through the concept of role.
We show strong empirical results under malformed or sparse reward.
arXiv Detail & Related papers (2023-08-01T15:56:24Z) - E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive.
We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation.
Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z) - Aligning Optimization Trajectories with Diffusion Models for Constrained
Design Generation [17.164961143132473]
We introduce a learning framework that demonstrates the efficacy of aligning the sampling trajectory of diffusion models with the optimization trajectory derived from traditional physics-based methods.
Our method allows for generating feasible and high-performance designs in as few as two steps without the need for expensive preprocessing, external surrogate models, or additional labeled data.
Our results demonstrate that TA outperforms state-of-the-art deep generative models on in-distribution configurations and halves the inference computational cost.
arXiv Detail & Related papers (2023-05-29T09:16:07Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.