Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization
is Sufficient
- URL: http://arxiv.org/abs/2006.07990v2
- Date: Thu, 11 Mar 2021 18:40:46 GMT
- Title: Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization
is Sufficient
- Authors: Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma and
Dimitris Papailiopoulos
- Abstract summary: We show that any target network of width $d$ and depth $l$ can be approximated by pruning a random network that is a factor $O(log(dl))$ wider and twice as deep.
Our analysis relies on connecting pruning random ReLU networks to random instances of the textscSubset problem.
- Score: 9.309655246559094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The strong {\it lottery ticket hypothesis} (LTH) postulates that one can
approximate any target neural network by only pruning the weights of a
sufficiently over-parameterized random network. A recent work by Malach et al.
\cite{MalachEtAl20} establishes the first theoretical analysis for the strong
LTH: one can provably approximate a neural network of width $d$ and depth $l$,
by pruning a random one that is a factor $O(d^4l^2)$ wider and twice as deep.
This polynomial over-parameterization requirement is at odds with recent
experimental research that achieves good approximation with networks that are a
small factor wider than the target. In this work, we close the gap and offer an
exponential improvement to the over-parameterization requirement for the
existence of lottery tickets. We show that any target network of width $d$ and
depth $l$ can be approximated by pruning a random network that is a factor
$O(\log(dl))$ wider and twice as deep. Our analysis heavily relies on
connecting pruning random ReLU networks to random instances of the
\textsc{SubsetSum} problem. We then show that this logarithmic
over-parameterization is essentially optimal for constant depth networks.
Finally, we verify several of our theoretical insights with experiments.
Related papers
- On the Sparsity of the Strong Lottery Ticket Hypothesis [8.47014750905382]
Research efforts have recently been made to show that a random neural network $N$ containsworks capable of accurately approximating any given neural network.
We provide the first proof of the Strong Lottery Ticket Hypothesis in classical settings, with guarantees on the sparsity of theworks.
arXiv Detail & Related papers (2024-10-18T06:57:37Z) - Matching the Statistical Query Lower Bound for k-sparse Parity Problems with Stochastic Gradient Descent [83.85536329832722]
We show that gradient descent (SGD) can efficiently solve the $k$-parity problem on a $d$dimensional hypercube.
We then demonstrate how a trained neural network with SGD, solving the $k$-parity problem with small statistical errors.
arXiv Detail & Related papers (2024-04-18T17:57:53Z) - Understanding Deep Neural Function Approximation in Reinforcement
Learning via $\epsilon$-Greedy Exploration [53.90873926758026]
This paper provides a theoretical study of deep neural function approximation in reinforcement learning (RL)
We focus on the value based algorithm with the $epsilon$-greedy exploration via deep (and two-layer) neural networks endowed by Besov (and Barron) function spaces.
Our analysis reformulates the temporal difference error in an $L2(mathrmdmu)$-integrable space over a certain averaged measure $mu$, and transforms it to a generalization problem under the non-iid setting.
arXiv Detail & Related papers (2022-09-15T15:42:47Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - Most Activation Functions Can Win the Lottery Without Excessive Depth [6.68999512375737]
Lottery ticket hypothesis has highlighted the potential for training deep neural networks by pruning.
For networks with ReLU activation functions, it has been proven that a target network with depth $L$ can be approximated by the subnetwork of a randomly neural network that has double the target's depth $2L$ and is wider by a logarithmic factor.
arXiv Detail & Related papers (2022-05-04T20:51:30Z) - A Law of Robustness beyond Isoperimetry [84.33752026418045]
We prove a Lipschitzness lower bound $Omega(sqrtn/p)$ of robustness of interpolating neural network parameters on arbitrary distributions.
We then show the potential benefit of overparametrization for smooth data when $n=mathrmpoly(d)$.
We disprove the potential existence of an $O(1)$-Lipschitz robust interpolating function when $n=exp(omega(d))$.
arXiv Detail & Related papers (2022-02-23T16:10:23Z) - Minimax Optimal Quantization of Linear Models: Information-Theoretic
Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements.
We derive an information-theoretic lower bound for the minimax risk under this setting.
We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - An $L^2$ Analysis of Reinforcement Learning in High Dimensions with
Kernel and Neural Network Approximation [9.088303226909277]
This paper considers the situation where the function approximation is made using the kernel method or the two-layer neural network model.
We establish an $tildeO(H3|mathcal A|frac14n-frac14)$ bound for the optimal policy with $Hn$ samples.
Even though this result still requires a finite-sized action space, the error bound is independent of the dimensionality of the state space.
arXiv Detail & Related papers (2021-04-15T21:59:03Z) - A Universal Approximation Theorem of Deep Neural Networks for Expressing
Probability Distributions [12.100913944042972]
We prove that there exists a deep neural network $g:mathbbRdrightarrow mathbbR$ with ReLU activation.
The size of neural network can grow exponentially in $d$ when $1$-Wasserstein distance is used as the discrepancy.
arXiv Detail & Related papers (2020-04-19T14:45:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.