Probabilistic fine-tuning of pruning masks and PAC-Bayes self-bounded
learning
- URL: http://arxiv.org/abs/2110.11804v1
- Date: Fri, 22 Oct 2021 14:25:22 GMT
- Title: Probabilistic fine-tuning of pruning masks and PAC-Bayes self-bounded
learning
- Authors: Soufiane Hayou, Bobby He, Gintare Karolina Dziugaite
- Abstract summary: We study an approach to learning pruning masks by optimizing the expected loss of pruning masks.
We analyze the training dynamics of the inducedadaptive predictor in the setting of linear regression.
We show that a PAC-Bayes generalization error bound is controlled by the magnitude of the change in feature alignment between the 'prior' and 'posterior' data.
- Score: 16.526326919313924
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study an approach to learning pruning masks by optimizing the expected
loss of stochastic pruning masks, i.e., masks which zero out each weight
independently with some weight-specific probability. We analyze the training
dynamics of the induced stochastic predictor in the setting of linear
regression, and observe a data-adaptive L1 regularization term, in contrast to
the dataadaptive L2 regularization term known to underlie dropout in linear
regression. We also observe a preference to prune weights that are less
well-aligned with the data labels. We evaluate probabilistic fine-tuning for
optimizing stochastic pruning masks for neural networks, starting from masks
produced by several baselines. In each case, we see improvements in test error
over baselines, even after we threshold fine-tuned stochastic pruning masks.
Finally, since a stochastic pruning mask induces a stochastic neural network,
we consider training the weights and/or pruning probabilities simultaneously to
minimize a PAC-Bayes bound on generalization error. Using data-dependent
priors, we obtain a selfbounded learning algorithm with strong performance and
numerically tight bounds. In the linear model, we show that a PAC-Bayes
generalization error bound is controlled by the magnitude of the change in
feature alignment between the 'prior' and 'posterior' data.
Related papers
- Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach [18.009376840944284]
We present an algorithm that can effectively regularize the Hessian loss matrices leading to regions with bound loss surfaces.
Our approach is effective for improving generalization in pretraining CLIP and chain-of-thought fine-tuning datasets.
arXiv Detail & Related papers (2023-06-14T14:58:36Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Improved uncertainty quantification for neural networks with Bayesian
last layer [0.0]
Uncertainty quantification is an important task in machine learning.
We present a reformulation of the log-marginal likelihood of a NN with BLL which allows for efficient training using backpropagation.
arXiv Detail & Related papers (2023-02-21T20:23:56Z) - Transformers meet Stochastic Block Models: Attention with Data-Adaptive
Sparsity and Cost [53.746169882193456]
Recent works have proposed various sparse attention modules to overcome the quadratic cost of self-attention.
We propose a model that resolves both problems by endowing each attention head with a mixed-membership Block Model.
Our model outperforms previous efficient variants as well as the original Transformer with full attention.
arXiv Detail & Related papers (2022-10-27T15:30:52Z) - GFlowOut: Dropout with Generative Flow Networks [76.59535235717631]
Monte Carlo Dropout has been widely used as a relatively cheap way for approximate Inference.
Recent works show that the dropout mask can be viewed as a latent variable, which can be inferred with variational inference.
GFlowOutleverages the recently proposed probabilistic framework of Generative Flow Networks (GFlowNets) to learn the posterior distribution over dropout masks.
arXiv Detail & Related papers (2022-10-24T03:00:01Z) - Transformers Can Do Bayesian Inference [56.99390658880008]
We present Prior-Data Fitted Networks (PFNs)
PFNs leverage in-context learning in large-scale machine learning techniques to approximate a large set of posteriors.
We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems.
arXiv Detail & Related papers (2021-12-20T13:07:39Z) - Scalable Marginal Likelihood Estimation for Model Selection in Deep
Learning [78.83598532168256]
Marginal-likelihood based model-selection is rarely used in deep learning due to estimation difficulties.
Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable.
arXiv Detail & Related papers (2021-04-11T09:50:24Z) - Variational Laplace for Bayesian neural networks [25.055754094939527]
Variational Laplace exploits a local approximation of the likelihood to estimate the ELBO without the need for sampling the neural-network weights.
We show that early-stopping can be avoided by increasing the learning rate for the variance parameters.
arXiv Detail & Related papers (2021-02-27T14:06:29Z) - Variational Laplace for Bayesian neural networks [33.46810568687292]
We develop variational Laplace for Bayesian neural networks (BNNs)
We exploit a local approximation of the curvature of the likelihood to estimate the ELBO without the need for sampling the neural-network weights.
We show that early-stopping can be avoided by increasing the learning rate for the variance parameters.
arXiv Detail & Related papers (2020-11-20T15:16:18Z) - Fast OSCAR and OWL Regression via Safe Screening Rules [97.28167655721766]
Ordered $L_1$ (OWL) regularized regression is a new regression analysis for high-dimensional sparse learning.
Proximal gradient methods are used as standard approaches to solve OWL regression.
We propose the first safe screening rule for OWL regression by exploring the order of the primal solution with the unknown order structure.
arXiv Detail & Related papers (2020-06-29T23:35:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.