Greedy Pruning with Group Lasso Provably Generalizes for Matrix Sensing
- URL: http://arxiv.org/abs/2303.11453v2
- Date: Sun, 4 Jun 2023 09:19:32 GMT
- Title: Greedy Pruning with Group Lasso Provably Generalizes for Matrix Sensing
- Authors: Nived Rajaraman, Devvrit, Aryan Mokhtari, Kannan Ramchandran
- Abstract summary: Pruning schemes have been widely used in practice to reduce the complexity of trained models with a massive number of parameters.
Running gradient descent in the absence of regularization results in models which are not suitable for greedy pruning, i.e., many columns could have their $ell$ norm comparable to that of the maximum.
Our results provide the first rigorous insights on why greedy pruning + fine-tuning leads to smaller models which also generalize well.
- Score: 30.508036898655114
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pruning schemes have been widely used in practice to reduce the complexity of
trained models with a massive number of parameters. In fact, several practical
studies have shown that if a pruned model is fine-tuned with some
gradient-based updates it generalizes well to new samples. Although the above
pipeline, which we refer to as pruning + fine-tuning, has been extremely
successful in lowering the complexity of trained models, there is very little
known about the theory behind this success. In this paper, we address this
issue by investigating the pruning + fine-tuning framework on the
overparameterized matrix sensing problem with the ground truth $U_\star \in
\mathbb{R}^{d \times r}$ and the overparameterized model $U \in \mathbb{R}^{d
\times k}$ with $k \gg r$. We study the approximate local minima of the mean
square error, augmented with a smooth version of a group Lasso regularizer,
$\sum_{i=1}^k \| U e_i \|_2$. In particular, we provably show that pruning all
the columns below a certain explicit $\ell_2$-norm threshold results in a
solution $U_{\text{prune}}$ which has the minimum number of columns $r$, yet
close to the ground truth in training loss. Moreover, in the subsequent
fine-tuning phase, gradient descent initialized at $U_{\text{prune}}$ converges
at a linear rate to its limit. While our analysis provides insights into the
role of regularization in pruning, we also show that running gradient descent
in the absence of regularization results in models which {are not suitable for
greedy pruning}, i.e., many columns could have their $\ell_2$ norm comparable
to that of the maximum. To the best of our knowledge, our results provide the
first rigorous insights on why greedy pruning + fine-tuning leads to smaller
models which also generalize well.
Related papers
- Pruning is Optimal for Learning Sparse Features in High-Dimensions [15.967123173054535]
We show that a class of statistical models can be optimally learned using pruned neural networks trained with gradient descent.
We show that pruning neural networks proportional to the sparsity level of $boldsymbolV$ improves their sample complexity compared to unpruned networks.
arXiv Detail & Related papers (2024-06-12T21:43:12Z) - Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes [29.466981306355066]
We show that gradient descent with a fixed learning rate $eta$ can only find local minima that represent smooth functions.
We also prove a nearly-optimal MSE bound of $widetildeO(n-4/5)$ within the strict interior of the support of the $n$ data points.
arXiv Detail & Related papers (2024-06-10T22:57:27Z) - Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems [61.85150061213987]
We study the generalized low-rank matrix bandit problem, proposed in citelu2021low under the Generalized Linear Model (GLM) framework.
To overcome the computational infeasibility and theoretical restrain of existing algorithms, we first propose the G-ESTT framework.
We show that G-ESTT can achieve the $tildeO(sqrt(d_1+d_2)3/2Mr3/2T)$ bound of regret while G-ESTS can achineve the $tildeO
arXiv Detail & Related papers (2024-01-14T14:14:19Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Implicit Regularization Leads to Benign Overfitting for Sparse Linear
Regression [16.551664358490658]
In deep learning, often the training process finds an interpolator (a solution with 0 training loss) but the test loss is still low.
One common mechanism for benign overfitting is implicit regularization, where the training process leads to additional properties for the interpolator.
We show that training our new model via gradient descent leads to an interpolator with near-optimal test loss.
arXiv Detail & Related papers (2023-02-01T05:41:41Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - KL-Entropy-Regularized RL with a Generative Model is Minimax Optimal [70.15267479220691]
We consider and analyze the sample complexity of model reinforcement learning with a generative variance-free model.
Our analysis shows that it is nearly minimax-optimal for finding an $varepsilon$-optimal policy when $varepsilon$ is sufficiently small.
arXiv Detail & Related papers (2022-05-27T19:39:24Z) - Minimax Optimal Quantization of Linear Models: Information-Theoretic
Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements.
We derive an information-theoretic lower bound for the minimax risk under this setting.
We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z) - Learning Sparse Graph with Minimax Concave Penalty under Gaussian Markov
Random Fields [51.07460861448716]
This paper presents a convex-analytic framework to learn from data.
We show that a triangular convexity decomposition is guaranteed by a transform of the corresponding to its upper part.
arXiv Detail & Related papers (2021-09-17T17:46:12Z) - Estimating Stochastic Linear Combination of Non-linear Regressions
Efficiently and Scalably [23.372021234032363]
We show that when the sub-sample sizes are large then the estimation errors will be sacrificed by too much.
To the best of our knowledge, this is the first work that and guarantees for the lineartext+Stochasticity model.
arXiv Detail & Related papers (2020-10-19T07:15:38Z) - Does the $\ell_1$-norm Learn a Sparse Graph under Laplacian Constrained
Graphical Models? [13.572602792770288]
We consider the problem of learning a graph under the Laplacian constrained Gaussian graphical models.
We show that a large regularization parameter will surprisingly lead to a complete graph, i.e., every edge connected by an edge.
arXiv Detail & Related papers (2020-06-26T12:06:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.