Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning
Ticket's Mask?
- URL: http://arxiv.org/abs/2210.03044v1
- Date: Thu, 6 Oct 2022 16:50:20 GMT
- Title: Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning
Ticket's Mask?
- Authors: Mansheej Paul, Feng Chen, Brett W. Larsen, Jonathan Frankle, Surya
Ganguli, Gintare Karolina Dziugaite
- Abstract summary: We show that an IMP mask found at the end of training conveys the identity of a desired subspace.
We also show that SGD can exploit this information due to a strong form of robustness.
Overall, our results make progress toward demystifying the existence of winning tickets.
- Score: 40.52143582292875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern deep learning involves training costly, highly overparameterized
networks, thus motivating the search for sparser networks that can still be
trained to the same accuracy as the full network (i.e. matching). Iterative
magnitude pruning (IMP) is a state of the art algorithm that can find such
highly sparse matching subnetworks, known as winning tickets. IMP operates by
iterative cycles of training, masking smallest magnitude weights, rewinding
back to an early training point, and repeating. Despite its simplicity, the
underlying principles for when and how IMP finds winning tickets remain
elusive. In particular, what useful information does an IMP mask found at the
end of training convey to a rewound network near the beginning of training? How
does SGD allow the network to extract this information? And why is iterative
pruning needed? We develop answers in terms of the geometry of the error
landscape. First, we find that$\unicode{x2014}$at higher
sparsities$\unicode{x2014}$pairs of pruned networks at successive pruning
iterations are connected by a linear path with zero error barrier if and only
if they are matching. This indicates that masks found at the end of training
convey the identity of an axial subspace that intersects a desired linearly
connected mode of a matching sublevel set. Second, we show SGD can exploit this
information due to a strong form of robustness: it can return to this mode
despite strong perturbations early in training. Third, we show how the flatness
of the error landscape at the end of training determines a limit on the
fraction of weights that can be pruned at each iteration of IMP. Finally, we
show that the role of retraining in IMP is to find a network with new small
weights to prune. Overall, these results make progress toward demystifying the
existence of winning tickets by revealing the fundamental role of error
landscape geometry.
Related papers
- No Free Prune: Information-Theoretic Barriers to Pruning at Initialization [8.125999058340998]
We show the Law of Robustness of arXiv:2105.12806 extends to sparse networks with the usual parameter count replaced by $p_texteff$.
Experiments on neural networks confirm that information gained during training may indeed affect model capacity.
arXiv Detail & Related papers (2024-02-02T01:13:16Z) - When Layers Play the Lottery, all Tickets Win at Initialization [0.0]
Pruning is a technique for reducing the computational cost of deep networks.
In this work, we propose to discover winning tickets when the pruning process removes layers.
Our winning tickets notably speed up the training phase and reduce up to 51% of carbon emission.
arXiv Detail & Related papers (2023-01-25T21:21:15Z) - Training Your Sparse Neural Network Better with Any Mask [106.134361318518]
Pruning large neural networks to create high-quality, independently trainable sparse masks is desirable.
In this paper we demonstrate an alternative opportunity: one can customize the sparse training techniques to deviate from the default dense network training protocols.
Our new sparse training recipe is generally applicable to improving training from scratch with various sparse masks.
arXiv Detail & Related papers (2022-06-26T00:37:33Z) - Lottery Tickets on a Data Diet: Finding Initializations with Sparse
Trainable Networks [40.55816472416984]
A striking observation about iterative training (IMP; Frankle et al.) is that $x$ after just a few hundred steps of dense $x2014x2014.
In this work, we seek to understand how this early phase of pre-training leads to good IMP for both the data and the network.
We identify novel properties of the loss landscape dense networks that are predictive of performance.
arXiv Detail & Related papers (2022-06-02T20:04:06Z) - Dual Lottery Ticket Hypothesis [71.95937879869334]
Lottery Ticket Hypothesis (LTH) provides a novel view to investigate sparse network training and maintain its capacity.
In this work, we regard the winning ticket from LTH as the subnetwork which is in trainable condition and its performance as our benchmark.
We propose a simple sparse network training strategy, Random Sparse Network Transformation (RST), to substantiate our DLTH.
arXiv Detail & Related papers (2022-03-08T18:06:26Z) - Coarsening the Granularity: Towards Structurally Sparse Lottery Tickets [127.56361320894861]
Lottery ticket hypothesis (LTH) has shown that dense models contain highly sparseworks (i.e., winning tickets) that can be trained in isolation to match full accuracy.
In this paper, we demonstrate the first positive result that a structurally sparse winning ticket can be effectively found in general.
Specifically, we first "re-fill" pruned elements back in some channels deemed to be important, and then "re-group" non-zero elements to create flexible group-wise structural patterns.
arXiv Detail & Related papers (2022-02-09T21:33:51Z) - The Elastic Lottery Ticket Hypothesis [106.79387235014379]
Lottery Ticket Hypothesis raises keen attention to identifying sparse trainableworks or winning tickets.
The most effective method to identify such winning tickets is still Iterative Magnitude-based Pruning.
We propose a variety of strategies to tweak the winning tickets found from different networks of the same model family.
arXiv Detail & Related papers (2021-03-30T17:53:45Z) - Good Students Play Big Lottery Better [84.6111281091602]
Lottery ticket hypothesis suggests that a dense neural network contains a sparse sub-network that can match the test accuracy of the original dense net.
Recent studies demonstrate that a sparse sub-network can still be obtained by using a rewinding technique.
This paper proposes a new, simpler and yet powerful technique for re-training the sub-network, called "Knowledge Distillation ticket" (KD ticket)
arXiv Detail & Related papers (2021-01-08T23:33:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.