[Reproducibility Report] Rigging the Lottery: Making All Tickets Winners
- URL: http://arxiv.org/abs/2103.15767v2
- Date: Tue, 30 Mar 2021 03:15:56 GMT
- Title: [Reproducibility Report] Rigging the Lottery: Making All Tickets Winners
- Authors: Varun Sundar, Rajat Vadiraj Dwaraknath
- Abstract summary: $textitRigL$, a sparse training algorithm, claims to directly train sparse networks that match or exceed the performance of existing dense-to-sparse training techniques.
We implement $textitRigL$ from scratch in Pytorch and reproduce its performance on CIFAR-10 within 0.1% of the reported value.
- Score: 1.6884611234933766
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: $\textit{RigL}$, a sparse training algorithm, claims to directly train sparse
networks that match or exceed the performance of existing dense-to-sparse
training techniques (such as pruning) for a fixed parameter count and compute
budget. We implement $\textit{RigL}$ from scratch in Pytorch and reproduce its
performance on CIFAR-10 within 0.1% of the reported value. On both
CIFAR-10/100, the central claim holds -- given a fixed training budget,
$\textit{RigL}$ surpasses existing dynamic-sparse training methods over a range
of target sparsities. By training longer, the performance can match or exceed
iterative pruning, while consuming constant FLOPs throughout training. We also
show that there is little benefit in tuning $\textit{RigL}$'s hyper-parameters
for every sparsity, initialization pair -- the reference choice of
hyperparameters is often close to optimal performance. Going beyond the
original paper, we find that the optimal initialization scheme depends on the
training constraint. While the Erdos-Renyi-Kernel distribution outperforms the
Uniform distribution for a fixed parameter count, for a fixed FLOP count, the
latter performs better. Finally, redistributing layer-wise sparsity while
training can bridge the performance gap between the two initialization schemes,
but increases computational cost.
Related papers
- DRIVE: Dual Gradient-Based Rapid Iterative Pruning [2.209921757303168]
Modern deep neural networks (DNNs) consist of millions of parameters, necessitating high-performance computing during training and inference.
Traditional pruning methods that are applied post-training focus on streamlining inference, but there are recent efforts to leverage sparsity early on by pruning before training.
We present Dual Gradient-Based Rapid Iterative Pruning (DRIVE), which leverages dense training for initial epochs to counteract the randomness inherent at the inception.
arXiv Detail & Related papers (2024-04-01T20:44:28Z) - RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation [30.797422827190278]
We present a new PEFT method called Robust Adaptation (RoSA) inspired by robust principal component analysis.
RoSA trains $textitlow-rank$ and $textithighly-sparse$ components on top of a set of fixed pretrained weights.
We show that RoSA outperforms LoRA, pure sparse fine-tuning, and alternative hybrid methods at the same parameter budget.
arXiv Detail & Related papers (2024-01-09T17:09:01Z) - Towards Understanding and Improving GFlowNet Training [71.85707593318297]
We introduce an efficient evaluation strategy to compare the learned sampling distribution to the target reward distribution.
We propose prioritized replay training of high-reward $x$, relative edge flow policy parametrization, and a novel guided trajectory balance objective.
arXiv Detail & Related papers (2023-05-11T22:50:41Z) - Does Continual Learning Equally Forget All Parameters? [55.431048995662714]
Distribution shift (e.g., task or domain shift) in continual learning (CL) usually results in catastrophic forgetting of neural networks.
We study which modules in neural networks are more prone to forgetting by investigating their training dynamics during CL.
We propose a more efficient and simpler method that entirely removes the every-step replay and replaces them by only $k$-times of FPF periodically triggered during CL.
arXiv Detail & Related papers (2023-04-09T04:36:24Z) - Learning a Consensus Sub-Network with Polarization Regularization and
One Pass Training [3.2214522506924093]
Pruning schemes create extra overhead either by iterative training and fine-tuning for static pruning or repeated computation of a dynamic pruning graph.
We propose a new parameter pruning strategy for learning a lighter-weight sub-network that minimizes the energy cost while maintaining comparable performance to the fully parameterised network on given downstream tasks.
Our results on CIFAR-10 and CIFAR-100 suggest that our scheme can remove 50% of connections in deep networks with less than 1% reduction in classification accuracy.
arXiv Detail & Related papers (2023-02-17T09:37:17Z) - Multi-Rate VAE: Train Once, Get the Full Rate-Distortion Curve [29.86440019821837]
Variational autoencoders (VAEs) are powerful tools for learning latent representations of data used in a wide range of applications.
In this paper, we introduce Multi-Rate VAE, a computationally efficient framework for learning optimal parameters corresponding to various $beta$ in a single training run.
arXiv Detail & Related papers (2022-12-07T19:02:34Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z) - FreeTickets: Accurate, Robust and Efficient Deep Ensemble by Training
with Dynamic Sparsity [74.58777701536668]
We introduce the FreeTickets concept, which can boost the performance of sparse convolutional neural networks over their dense network equivalents by a large margin.
We propose two novel efficient ensemble methods with dynamic sparsity, which yield in one shot many diverse and accurate tickets "for free" during the sparse training process.
arXiv Detail & Related papers (2021-06-28T10:48:20Z) - Chasing Sparsity in Vision Transformers: An End-to-End Exploration [127.10054032751714]
Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting.
This paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy.
Specifically, instead of training full ViTs, we dynamically extract and train sparseworks, while sticking to a fixed small parameter budget.
arXiv Detail & Related papers (2021-06-08T17:18:00Z) - Parameter-Efficient Transfer Learning with Diff Pruning [108.03864629388404]
diff pruning is a simple approach to enable parameter-efficient transfer learning within the pretrain-finetune framework.
We find that models finetuned with diff pruning can match the performance of fully finetuned baselines on the GLUE benchmark.
arXiv Detail & Related papers (2020-12-14T12:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.