Related papers: Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient

Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient

URL: http://arxiv.org/abs/2006.07990v2
Date: Thu, 11 Mar 2021 18:40:46 GMT
Title: Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient
Authors: Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma and Dimitris Papailiopoulos
Abstract summary: We show that any target network of width $d$ and depth $l$ can be approximated by pruning a random network that is a factor $O(log(dl))$ wider and twice as deep. Our analysis relies on connecting pruning random ReLU networks to random instances of the textscSubset problem.
Score: 9.309655246559094
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The strong {\it lottery ticket hypothesis} (LTH) postulates that one can approximate any target neural network by only pruning the weights of a sufficiently over-parameterized random network. A recent work by Malach et al. \cite{MalachEtAl20} establishes the first theoretical analysis for the strong LTH: one can provably approximate a neural network of width $d$ and depth $l$, by pruning a random one that is a factor $O(d^4l^2)$ wider and twice as deep. This polynomial over-parameterization requirement is at odds with recent experimental research that achieves good approximation with networks that are a small factor wider than the target. In this work, we close the gap and offer an exponential improvement to the over-parameterization requirement for the existence of lottery tickets. We show that any target network of width $d$ and depth $l$ can be approximated by pruning a random network that is a factor $O(\log(dl))$ wider and twice as deep. Our analysis heavily relies on connecting pruning random ReLU networks to random instances of the \textsc{SubsetSum} problem. We then show that this logarithmic over-parameterization is essentially optimal for constant depth networks. Finally, we verify several of our theoretical insights with experiments.

Related papers

Learning Networks from Wide-Sense Stationary Stochastic Processes [7.59499154221528]
A key inference problem here is to learn edge connectivity from node outputs (potentials) We use a Whittle's maximum likelihood estimator (MLE) to learn the support of $Last$ from temporally correlated samples. We show that the MLE problem is strictly convex, admitting a unique solution.
arXiv Detail & Related papers (2024-12-04T23:14:00Z)
On the Sparsity of the Strong Lottery Ticket Hypothesis [8.47014750905382]
Research efforts have recently been made to show that a random neural network $N$ containsworks capable of accurately approximating any given neural network. We provide the first proof of the Strong Lottery Ticket Hypothesis in classical settings, with guarantees on the sparsity of theworks.
arXiv Detail & Related papers (2024-10-18T06:57:37Z)
Matching the Statistical Query Lower Bound for k-sparse Parity Problems with Stochastic Gradient Descent [83.85536329832722]
We show that gradient descent (SGD) can efficiently solve the $k$-parity problem on a $d$dimensional hypercube. We then demonstrate how a trained neural network with SGD, solving the $k$-parity problem with small statistical errors.
arXiv Detail & Related papers (2024-04-18T17:57:53Z)
Understanding Deep Neural Function Approximation in Reinforcement Learning via $\epsilon$-Greedy Exploration [53.90873926758026]
This paper provides a theoretical study of deep neural function approximation in reinforcement learning (RL) We focus on the value based algorithm with the $epsilon$-greedy exploration via deep (and two-layer) neural networks endowed by Besov (and Barron) function spaces. Our analysis reformulates the temporal difference error in an $L2(mathrmdmu)$-integrable space over a certain averaged measure $mu$, and transforms it to a generalization problem under the non-iid setting.
arXiv Detail & Related papers (2022-09-15T15:42:47Z)
Bounding the Width of Neural Networks via Coupled Initialization -- A Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks. We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z)
Most Activation Functions Can Win the Lottery Without Excessive Depth [6.68999512375737]
Lottery ticket hypothesis has highlighted the potential for training deep neural networks by pruning. For networks with ReLU activation functions, it has been proven that a target network with depth $L$ can be approximated by the subnetwork of a randomly neural network that has double the target's depth $2L$ and is wider by a logarithmic factor.
arXiv Detail & Related papers (2022-05-04T20:51:30Z)
On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK) In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z)
A Law of Robustness beyond Isoperimetry [84.33752026418045]
We prove a Lipschitzness lower bound $Omega(sqrtn/p)$ of robustness of interpolating neural network parameters on arbitrary distributions. We then show the potential benefit of overparametrization for smooth data when $n=mathrmpoly(d)$. We disprove the potential existence of an $O(1)$-Lipschitz robust interpolating function when $n=exp(omega(d))$.
arXiv Detail & Related papers (2022-02-23T16:10:23Z)
Minimax Optimal Quantization of Linear Models: Information-Theoretic Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements. We derive an information-theoretic lower bound for the minimax risk under this setting. We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z)
Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss. We examine how these benign overfitting phenomena occur in a two-layer neural network setting. We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z)
An $L^2$ Analysis of Reinforcement Learning in High Dimensions with Kernel and Neural Network Approximation [9.088303226909277]
This paper considers the situation where the function approximation is made using the kernel method or the two-layer neural network model. We establish an $tildeO(H3|mathcal A|frac14n-frac14)$ bound for the optimal policy with $Hn$ samples. Even though this result still requires a finite-sized action space, the error bound is independent of the dimensionality of the state space.
arXiv Detail & Related papers (2021-04-15T21:59:03Z)
A Universal Approximation Theorem of Deep Neural Networks for Expressing Probability Distributions [12.100913944042972]
We prove that there exists a deep neural network $g:mathbbRdrightarrow mathbbR$ with ReLU activation. The size of neural network can grow exponentially in $d$ when $1$-Wasserstein distance is used as the discrepancy.
arXiv Detail & Related papers (2020-04-19T14:45:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.