Related papers: Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks

Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks

URL: http://arxiv.org/abs/2206.01278v1
Date: Thu, 2 Jun 2022 20:04:06 GMT
Title: Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks
Authors: Mansheej Paul, Brett W. Larsen, Surya Ganguli, Jonathan Frankle, Gintare Karolina Dziugaite
Abstract summary: A striking observation about iterative training (IMP; Frankle et al.) is that $x$ after just a few hundred steps of dense $x2014x2014. In this work, we seek to understand how this early phase of pre-training leads to good IMP for both the data and the network. We identify novel properties of the loss landscape dense networks that are predictive of performance.
Score: 40.55816472416984
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A striking observation about iterative magnitude pruning (IMP; Frankle et al. 2020) is that $\unicode{x2014}$ after just a few hundred steps of dense training $\unicode{x2014}$ the method can find a sparse sub-network that can be trained to the same accuracy as the dense network. However, the same does not hold at step 0, i.e. random initialization. In this work, we seek to understand how this early phase of pre-training leads to a good initialization for IMP both through the lens of the data distribution and the loss landscape geometry. Empirically we observe that, holding the number of pre-training iterations constant, training on a small fraction of (randomly chosen) data suffices to obtain an equally good initialization for IMP. We additionally observe that by pre-training only on "easy" training data, we can decrease the number of steps necessary to find a good initialization for IMP compared to training on the full dataset or a randomly chosen subset. Finally, we identify novel properties of the loss landscape of dense networks that are predictive of IMP performance, showing in particular that more examples being linearly mode connected in the dense network correlates well with good initializations for IMP. Combined, these results provide new insight into the role played by the early phase training in IMP.

Related papers

Find A Winning Sign: Sign Is All We Need to Win the Lottery [52.63674911541416]
We show that a sparse network trained by an existing IP method can retain its basin of attraction if its parameter signs and normalization layer parameters are preserved. To take a step closer to finding a winning ticket, we alleviate the reliance on normalization layer parameters by preventing high error barriers along the linear path between the sparse network trained by our method and its counterpart with normalization layer parameters.
arXiv Detail & Related papers (2025-04-07T09:30:38Z)
Finding Stable Subnetworks at Initialization with Dataset Distillation [9.254047358707016]
We use distilled data in the inner loop of iterative magnitude pruning to produce sparse, trainableworks at initialization. Our algorithm matches the performance of traditional lottery ticket rewinding on ResNet-18 & CIFAR-10.
arXiv Detail & Related papers (2025-03-23T02:55:57Z)
Finding Lottery Tickets in Vision Models via Data-driven Spectral Foresight Pruning [14.792099973449794]
We propose an algorithm to align the training dynamics of the sparse network with that of the dense one. We show how the usually neglected data-dependent component in the NTK's spectrum can be taken into account. Path eXclusion (PX) is able to find lottery tickets even at high sparsity levels.
arXiv Detail & Related papers (2024-06-03T22:19:42Z)
Sparser, Better, Deeper, Stronger: Improving Sparse Training with Exact Orthogonal Initialization [49.06421851486415]
Static sparse training aims to train sparse models from scratch, achieving remarkable results in recent years. We propose Exact Orthogonal Initialization (EOI), a novel sparse Orthogonal Initialization scheme based on random Givens rotations. Our method enables training highly sparse 1000-layer and CNN networks without residual connections or normalization techniques.
arXiv Detail & Related papers (2024-06-03T19:44:47Z)
Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data [42.870635753205185]
Neural networks trained by gradient descent (GD) have exhibited a number of surprising generalization behaviors. We show that both of these phenomena provably occur in two-layer ReLU networks trained by GD on XOR cluster data. At a later training step, the network achieves near-optimal test accuracy while still fitting the random labels in the training data, exhibiting a "grokking" phenomenon.
arXiv Detail & Related papers (2023-10-04T02:50:34Z)
Unsupervised Learning of Initialization in Deep Neural Networks via Maximum Mean Discrepancy [74.34895342081407]
We propose an unsupervised algorithm to find good initialization for input data. We first notice that each parameter configuration in the parameter space corresponds to one particular downstream task of d-way classification. We then conjecture that the success of learning is directly related to how diverse downstream tasks are in the vicinity of the initial parameters.
arXiv Detail & Related papers (2023-02-08T23:23:28Z)
SEPT: Towards Scalable and Efficient Visual Pre-Training [11.345844145289524]
Self-supervised pre-training has shown great potential in leveraging large-scale unlabeled data to improve downstream task performance. We build a task-specific self-supervised pre-training framework based on a simple hypothesis that pre-training on the unlabeled samples with similar distribution to the target task can bring substantial performance gains.
arXiv Detail & Related papers (2022-12-11T11:02:11Z)
Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning Ticket's Mask? [40.52143582292875]
We show that an IMP mask found at the end of training conveys the identity of a desired subspace. We also show that SGD can exploit this information due to a strong form of robustness. Overall, our results make progress toward demystifying the existence of winning tickets.
arXiv Detail & Related papers (2022-10-06T16:50:20Z)
Transformers Can Do Bayesian Inference [56.99390658880008]
We present Prior-Data Fitted Networks (PFNs) PFNs leverage in-context learning in large-scale machine learning techniques to approximate a large set of posteriors. We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems.
arXiv Detail & Related papers (2021-12-20T13:07:39Z)
Subquadratic Overparameterization for Shallow Neural Networks [60.721751363271146]
We provide an analytical framework that allows us to adopt standard neural training strategies. We achieve the desiderata viaak-Lojasiewicz, smoothness, and standard assumptions.
arXiv Detail & Related papers (2021-11-02T20:24:01Z)
Dense for the Price of Sparse: Improved Performance of Sparsely Initialized Networks via a Subspace Offset [0.0]
We introduce a new DCT plus Sparse' layer architecture, which maintains information propagation and trainability even with as little as 0.01% trainable kernel parameters remaining. Switching from standard sparse layers to DCT plus Sparse layers does not increase the storage footprint of a network and incurs only a small additional computational overhead.
arXiv Detail & Related papers (2021-02-12T00:05:02Z)
Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function. We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model. We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.