Instant Soup: Cheap Pruning Ensembles in A Single Pass Can Draw Lottery
Tickets from Large Models
- URL: http://arxiv.org/abs/2306.10460v1
- Date: Sun, 18 Jun 2023 03:09:52 GMT
- Title: Instant Soup: Cheap Pruning Ensembles in A Single Pass Can Draw Lottery
Tickets from Large Models
- Authors: Ajay Jaiswal, Shiwei Liu, Tianlong Chen, Ying Ding, Zhangyang Wang
- Abstract summary: Lottery Ticket Hypothesis (LTH) and its variants have been exploited to prune large pre-trained models generating parameterworks.
LTH is enormously inhibited by repetitive full training and pruning routine of iterative magnitude pruning (IMP)
We propose Instant Soup Pruning (ISP) to generate lottery ticket quality IMPworks.
- Score: 106.19385911520652
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large pre-trained transformers have been receiving explosive attention in the
past few years, due to their wide adaptability for numerous downstream
applications via fine-tuning, but their exponentially increasing parameter
counts are becoming a primary hurdle to even just fine-tune them without
industry-standard hardware. Recently, Lottery Ticket Hypothesis (LTH) and its
variants, have been exploited to prune these large pre-trained models
generating subnetworks that can achieve similar performance as their dense
counterparts, but LTH pragmatism is enormously inhibited by repetitive full
training and pruning routine of iterative magnitude pruning (IMP) which worsens
with increasing model size. Motivated by the recent observations of model
soups, which suggest that fine-tuned weights of multiple models can be merged
to a better minima, we propose Instant Soup Pruning (ISP) to generate lottery
ticket quality subnetworks, using a fraction of the original IMP cost by
replacing the expensive intermediate pruning stages of IMP with computationally
efficient weak mask generation and aggregation routine. More specifically,
during the mask generation stage, ISP takes a small handful of iterations using
varying training protocols and data subsets to generate many weak and noisy
subnetworks, and superpose them to average out the noise creating a
high-quality denoised subnetwork. Our extensive experiments and ablation on two
popular large-scale pre-trained models: CLIP (unexplored in pruning till date)
and BERT across multiple benchmark vision and language datasets validate the
effectiveness of ISP compared to several state-of-the-art pruning methods.
Codes are available at: \url{https://github.com/VITA-Group/instant_soup}
Related papers
- Pruning Large Language Models with Semi-Structural Adaptive Sparse Training [17.381160429641316]
We propose a pruning pipeline for semi-structured sparse models via retraining, termed Adaptive Sparse Trainer (AST)
AST transforms dense models into sparse ones by applying decay to masked weights while allowing the model to adaptively select masks throughout the training process.
Our work demonstrates the feasibility of deploying semi-structured sparse large language models and introduces a novel method for achieving highly compressed models.
arXiv Detail & Related papers (2024-07-30T06:33:44Z) - SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [53.638791265113625]
Sparsity-Preserved efficient fine-tuning method for large language models.
Code will be made available at https://github.com/Lucky-Lance/SPP.
arXiv Detail & Related papers (2024-05-25T04:55:27Z) - Efficient Stitchable Task Adaptation [47.94819192325723]
We present a novel framework, Efficient Stitchable Task Adaptation (ESTA), to efficiently produce a palette of fine-tuned models.
Specifically, we first tailor parameter-efficient fine-tuning to share low-rank updates among the stitches.
We streamline a simple yet effective one-stage deployment pipeline, which estimates the important stitches to deploy.
arXiv Detail & Related papers (2023-11-29T04:31:35Z) - Uncertainty-aware Parameter-Efficient Self-training for Semi-supervised
Language Understanding [38.11411155621616]
We study self-training as one of the predominant semi-supervised learning approaches.
We present UPET, a novel Uncertainty-aware self-Training framework.
We show that UPET achieves a substantial improvement in terms of performance and efficiency.
arXiv Detail & Related papers (2023-10-19T02:18:29Z) - Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints [59.39280540478479]
We propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint.
We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet.
arXiv Detail & Related papers (2022-12-09T18:57:37Z) - Efficient Stein Variational Inference for Reliable Distribution-lossless
Network Pruning [23.22021752821507]
We propose a novel distribution-lossless pruning method, named vanillaP, to theoretically find the pruned lottery within Bayesian treatment.
Our method can obtain sparser networks with great performance while providing quantified reliability for the pruned model.
arXiv Detail & Related papers (2022-12-07T09:31:47Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Dual Lottery Ticket Hypothesis [71.95937879869334]
Lottery Ticket Hypothesis (LTH) provides a novel view to investigate sparse network training and maintain its capacity.
In this work, we regard the winning ticket from LTH as the subnetwork which is in trainable condition and its performance as our benchmark.
We propose a simple sparse network training strategy, Random Sparse Network Transformation (RST), to substantiate our DLTH.
arXiv Detail & Related papers (2022-03-08T18:06:26Z) - Rethinking Network Pruning -- under the Pre-train and Fine-tune Paradigm [5.621336109915588]
We show for the first time that sparse pruning compresses a BERT model significantly more than reducing its number of channels and layers.
Our method outperforms the leading competitors with a 20-times weight/FLOPs compression and neglectable loss in prediction accuracy.
arXiv Detail & Related papers (2021-04-18T02:20:37Z) - The Elastic Lottery Ticket Hypothesis [106.79387235014379]
Lottery Ticket Hypothesis raises keen attention to identifying sparse trainableworks or winning tickets.
The most effective method to identify such winning tickets is still Iterative Magnitude-based Pruning.
We propose a variety of strategies to tweak the winning tickets found from different networks of the same model family.
arXiv Detail & Related papers (2021-03-30T17:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.