Rare Gems: Finding Lottery Tickets at Initialization
- URL: http://arxiv.org/abs/2202.12002v1
- Date: Thu, 24 Feb 2022 10:28:56 GMT
- Title: Rare Gems: Finding Lottery Tickets at Initialization
- Authors: Kartik Sreenivasan, Jy-yong Sohn, Liu Yang, Matthew Grinde, Alliot
Nagle, Hongyi Wang, Kangwook Lee, Dimitris Papailiopoulos
- Abstract summary: Large neural networks can be pruned to a small fraction of their original size.
Current algorithms for finding trainable networks fail simple baseline comparisons.
Finding lottery tickets that train to better accuracy compared to simple baselines remains an open problem.
- Score: 21.130411799740532
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It has been widely observed that large neural networks can be pruned to a
small fraction of their original size, with little loss in accuracy, by
typically following a time-consuming "train, prune, re-train" approach. Frankle
& Carbin (2018) conjecture that we can avoid this by training lottery tickets,
i.e., special sparse subnetworks found at initialization, that can be trained
to high accuracy. However, a subsequent line of work presents concrete evidence
that current algorithms for finding trainable networks at initialization, fail
simple baseline comparisons, e.g., against training random sparse subnetworks.
Finding lottery tickets that train to better accuracy compared to simple
baselines remains an open problem. In this work, we partially resolve this open
problem by discovering rare gems: subnetworks at initialization that attain
considerable accuracy, even before training. Refining these rare gems - "by
means of fine-tuning" - beats current baselines and leads to accuracy
competitive or better than magnitude pruning methods.
Related papers
- Intersection of Parallels as an Early Stopping Criterion [64.8387564654474]
We propose a method to spot an early stopping point in the training iterations without the need for a validation set.
For a wide range of learning rates, our method, called Cosine-Distance Criterion (CDC), leads to better generalization on average than all the methods that we compare against.
arXiv Detail & Related papers (2022-08-19T19:42:41Z) - Lottery Tickets on a Data Diet: Finding Initializations with Sparse
Trainable Networks [40.55816472416984]
A striking observation about iterative training (IMP; Frankle et al.) is that $x$ after just a few hundred steps of dense $x2014x2014.
In this work, we seek to understand how this early phase of pre-training leads to good IMP for both the data and the network.
We identify novel properties of the loss landscape dense networks that are predictive of performance.
arXiv Detail & Related papers (2022-06-02T20:04:06Z) - Dual Lottery Ticket Hypothesis [71.95937879869334]
Lottery Ticket Hypothesis (LTH) provides a novel view to investigate sparse network training and maintain its capacity.
In this work, we regard the winning ticket from LTH as the subnetwork which is in trainable condition and its performance as our benchmark.
We propose a simple sparse network training strategy, Random Sparse Network Transformation (RST), to substantiate our DLTH.
arXiv Detail & Related papers (2022-03-08T18:06:26Z) - The Unreasonable Effectiveness of Random Pruning: Return of the Most
Naive Baseline for Sparse Training [111.15069968583042]
Random pruning is arguably the most naive way to attain sparsity in neural networks, but has been deemed uncompetitive by either post-training pruning or sparse training.
We empirically demonstrate that sparsely training a randomly pruned network from scratch can match the performance of its dense equivalent.
Our results strongly suggest there is larger-than-expected room for sparse training at scale, and the benefits of sparsity might be more universal beyond carefully designed pruning.
arXiv Detail & Related papers (2022-02-05T21:19:41Z) - Plant 'n' Seek: Can You Find the Winning Ticket? [6.85316573653194]
Lottery ticket hypothesis has sparked the rapid development of pruning algorithms that perform structure learning.
We hand-craft extremely sparse network topologies, plant them in large neural networks, and evaluate state-of-the-art lottery ticket pruning methods.
arXiv Detail & Related papers (2021-11-22T12:32:25Z) - Towards Understanding Iterative Magnitude Pruning: Why Lottery Tickets
Win [20.97456178983006]
Lottery ticket hypothesis states that sparseworks exist in randomly dense networks that can be trained to the same accuracy as the dense network they reside in.
We show that by using a training method that is stable with respect to linear mode connectivity, large networks can also be entirely rewound to initialization.
arXiv Detail & Related papers (2021-06-13T10:06:06Z) - Lottery Ticket Implies Accuracy Degradation, Is It a Desirable
Phenomenon? [43.47794674403988]
In deep model compression, the recent finding "Lottery Ticket Hypothesis" (LTH) (Frankle & Carbin) pointed out that there could exist a winning ticket.
We investigate the underlying condition and rationale behind the winning property, and find that the underlying reason is largely attributed to the correlation between weights and final-trained weights.
We propose the "pruning & fine-tuning" method that consistently outperforms lottery ticket sparse training.
arXiv Detail & Related papers (2021-02-19T14:49:46Z) - Good Students Play Big Lottery Better [84.6111281091602]
Lottery ticket hypothesis suggests that a dense neural network contains a sparse sub-network that can match the test accuracy of the original dense net.
Recent studies demonstrate that a sparse sub-network can still be obtained by using a rewinding technique.
This paper proposes a new, simpler and yet powerful technique for re-training the sub-network, called "Knowledge Distillation ticket" (KD ticket)
arXiv Detail & Related papers (2021-01-08T23:33:53Z) - The Lottery Ticket Hypothesis for Pre-trained BERT Networks [137.99328302234338]
In natural language processing (NLP), enormous pre-trained models like BERT have become the standard starting point for training.
In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matchingworks capable of training in isolation to full accuracy.
We combine these observations to assess whether such trainable, transferrableworks exist in pre-trained BERT models.
arXiv Detail & Related papers (2020-07-23T19:35:39Z) - Distance-Based Regularisation of Deep Networks for Fine-Tuning [116.71288796019809]
We develop an algorithm that constrains a hypothesis class to a small sphere centred on the initial pre-trained weights.
Empirical evaluation shows that our algorithm works well, corroborating our theoretical results.
arXiv Detail & Related papers (2020-02-19T16:00:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.