Related papers: Pruning is Optimal for Learning Sparse Features in High-Dimensions

Pruning is Optimal for Learning Sparse Features in High-Dimensions

URL: http://arxiv.org/abs/2406.08658v1
Date: Wed, 12 Jun 2024 21:43:12 GMT
Title: Pruning is Optimal for Learning Sparse Features in High-Dimensions
Authors: Nuri Mert Vural, Murat A. Erdogdu,
Abstract summary: We show that a class of statistical models can be optimally learned using pruned neural networks trained with gradient descent. We show that pruning neural networks proportional to the sparsity level of $boldsymbolV$ improves their sample complexity compared to unpruned networks.
Score: 15.967123173054535
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While it is commonly observed in practice that pruning networks to a certain level of sparsity can improve the quality of the features, a theoretical explanation of this phenomenon remains elusive. In this work, we investigate this by demonstrating that a broad class of statistical models can be optimally learned using pruned neural networks trained with gradient descent, in high-dimensions. We consider learning both single-index and multi-index models of the form $y = \sigma^*(\boldsymbol{V}^{\top} \boldsymbol{x}) + \epsilon$, where $\sigma^*$ is a degree-$p$ polynomial, and $\boldsymbol{V} \in \mathbbm{R}^{d \times r}$ with $r \ll d$, is the matrix containing relevant model directions. We assume that $\boldsymbol{V}$ satisfies a certain $\ell_q$-sparsity condition for matrices and show that pruning neural networks proportional to the sparsity level of $\boldsymbol{V}$ improves their sample complexity compared to unpruned networks. Furthermore, we establish Correlational Statistical Query (CSQ) lower bounds in this setting, which take the sparsity level of $\boldsymbol{V}$ into account. We show that if the sparsity level of $\boldsymbol{V}$ exceeds a certain threshold, training pruned networks with a gradient descent algorithm achieves the sample complexity suggested by the CSQ lower bound. In the same scenario, however, our results imply that basis-independent methods such as models trained via standard gradient descent initialized with rotationally invariant random weights can provably achieve only suboptimal sample complexity.

Related papers

Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws [21.18373933718468]
We study the optimization and sample complexity of gradient-based training of a two-layer neural network with quadratic activation function in the high-dimensional regime.<n>We present a sharp analysis of the dynamics in the feature learning regime, for both the population limit and the finite-sample discretization.
arXiv Detail & Related papers (2025-08-05T17:57:56Z)
The Generative Leap: Sharp Sample Complexity for Efficiently Learning Gaussian Multi-Index Models [71.5283441529015]
In this work we consider generic Gaussian Multi-index models, in which the labels only depend on the (Gaussian) $d$-dimensional inputs through their projection onto a low-dimensional $r = O_d(1)$ subspace.<n>We introduce the generative leap exponent $kstar$, a natural extension of the generative exponent from [Damian et al.'24] to the multi-index setting.
arXiv Detail & Related papers (2025-06-05T18:34:56Z)
Learning Networks from Wide-Sense Stationary Stochastic Processes [7.59499154221528]
A key inference problem here is to learn edge connectivity from node outputs (potentials) We use a Whittle's maximum likelihood estimator (MLE) to learn the support of $Last$ from temporally correlated samples. We show that the MLE problem is strictly convex, admitting a unique solution.
arXiv Detail & Related papers (2024-12-04T23:14:00Z)
Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations [40.77319247558742]
We study the computational complexity of learning a target function $f_*:mathbbRdtomathbbR$ with additive structure. We prove that a large subset of $f_*$ can be efficiently learned by gradient training of a two-layer neural network.
arXiv Detail & Related papers (2024-06-17T17:59:17Z)
Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ under isotropic Gaussian data. We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot d
arXiv Detail & Related papers (2024-06-03T17:56:58Z)
Matching the Statistical Query Lower Bound for $k$-Sparse Parity Problems with Sign Stochastic Gradient Descent [83.85536329832722]
We solve the $k$-sparse parity problem with sign gradient descent (SGD) on two-layer fully-connected neural networks. We show that this approach can efficiently solve the $k$-sparse parity problem on a $d$-dimensional hypercube. We then demonstrate how a trained neural network with sign SGD can effectively approximate this good network, solving the $k$-parity problem with small statistical errors.
arXiv Detail & Related papers (2024-04-18T17:57:53Z)
Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure. We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z)
Learning Hierarchical Polynomials with Three-Layer Neural Networks [56.71223169861528]
We study the problem of learning hierarchical functions over the standard Gaussian distribution with three-layer neural networks. For a large subclass of degree $k$s $p$, a three-layer neural network trained via layerwise gradientp descent on the square loss learns the target $h$ up to vanishing test error. This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
arXiv Detail & Related papers (2023-11-23T02:19:32Z)
Beyond NTK with Vanilla Gradient Descent: A Mean-Field Analysis of Neural Networks with Polynomial Width, Samples, and Time [37.73689342377357]
It is still an open question whether gradient descent on networks without unnatural modifications can achieve better sample complexity than kernel methods. We show that projected gradient descent with a positive learning number converges to low error with the same sample.
arXiv Detail & Related papers (2023-06-28T16:45:38Z)
Effective Minkowski Dimension of Deep Nonparametric Regression: Function Approximation and Statistical Theories [70.90012822736988]
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to intrinsic data structures. This paper introduces a relaxed assumption that input data are concentrated around a subset of $mathbbRd$ denoted by $mathcalS$, and the intrinsic dimension $mathcalS$ can be characterized by a new complexity notation -- effective Minkowski dimension.
arXiv Detail & Related papers (2023-06-26T17:13:31Z)
Generalization and Stability of Interpolating Neural Networks with Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime. We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$. With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z)
Neural Networks Efficiently Learn Low-Dimensional Representations with SGD [22.703825902761405]
We show that SGD-trained ReLU NNs can learn a single-index target of the form $y=f(langleboldsymbolu,boldsymbolxrangle) + epsilon$ by recovering the principal direction. We also provide compress guarantees for NNs using the approximate low-rank structure produced by SGD.
arXiv Detail & Related papers (2022-09-29T15:29:10Z)
High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network. Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z)
Approximate Function Evaluation via Multi-Armed Bandits [51.146684847667125]
We study the problem of estimating the value of a known smooth function $f$ at an unknown point $boldsymbolmu in mathbbRn$, where each component $mu_i$ can be sampled via a noisy oracle. We design an instance-adaptive algorithm that learns to sample according to the importance of each coordinate, and with probability at least $1-delta$ returns an $epsilon$ accurate estimate of $f(boldsymbolmu)$.
arXiv Detail & Related papers (2022-03-18T18:50:52Z)
The generalization error of max-margin linear classifiers: Benign overfitting and high dimensional asymptotics in the overparametrized regime [11.252856459394854]
Modern machine learning classifiers often exhibit vanishing classification error on the training set. Motivated by these phenomena, we revisit high-dimensional maximum margin classification for linearly separable data.
arXiv Detail & Related papers (2019-11-05T00:15:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.