Related papers: Probabilistic fine-tuning of pruning masks and PAC-Bayes self-bounded learning

Probabilistic fine-tuning of pruning masks and PAC-Bayes self-bounded learning

URL: http://arxiv.org/abs/2110.11804v1
Date: Fri, 22 Oct 2021 14:25:22 GMT
Title: Probabilistic fine-tuning of pruning masks and PAC-Bayes self-bounded learning
Authors: Soufiane Hayou, Bobby He, Gintare Karolina Dziugaite
Abstract summary: We study an approach to learning pruning masks by optimizing the expected loss of pruning masks. We analyze the training dynamics of the inducedadaptive predictor in the setting of linear regression. We show that a PAC-Bayes generalization error bound is controlled by the magnitude of the change in feature alignment between the 'prior' and 'posterior' data.
Score: 16.526326919313924
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study an approach to learning pruning masks by optimizing the expected loss of stochastic pruning masks, i.e., masks which zero out each weight independently with some weight-specific probability. We analyze the training dynamics of the induced stochastic predictor in the setting of linear regression, and observe a data-adaptive L1 regularization term, in contrast to the dataadaptive L2 regularization term known to underlie dropout in linear regression. We also observe a preference to prune weights that are less well-aligned with the data labels. We evaluate probabilistic fine-tuning for optimizing stochastic pruning masks for neural networks, starting from masks produced by several baselines. In each case, we see improvements in test error over baselines, even after we threshold fine-tuned stochastic pruning masks. Finally, since a stochastic pruning mask induces a stochastic neural network, we consider training the weights and/or pruning probabilities simultaneously to minimize a PAC-Bayes bound on generalization error. Using data-dependent priors, we obtain a selfbounded learning algorithm with strong performance and numerically tight bounds. In the linear model, we show that a PAC-Bayes generalization error bound is controlled by the magnitude of the change in feature alignment between the 'prior' and 'posterior' data.

Related papers

SeWA: Selective Weight Average via Probabilistic Masking [51.015724517293236]
We show that only a few points are needed to achieve better and faster convergence. We transform the discrete selection problem into a continuous subset optimization framework. We derive the SeWA's stability bounds, which are sharper than that under both convex image checkpoints.
arXiv Detail & Related papers (2025-02-14T12:35:21Z)
Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach [18.009376840944284]
We present an algorithm that can effectively regularize the Hessian loss matrices leading to regions with bound loss surfaces. Our approach is effective for improving generalization in pretraining CLIP and chain-of-thought fine-tuning datasets.
arXiv Detail & Related papers (2023-06-14T14:58:36Z)
Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT) MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies. Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z)
Improved uncertainty quantification for neural networks with Bayesian last layer [0.0]
Uncertainty quantification is an important task in machine learning. We present a reformulation of the log-marginal likelihood of a NN with BLL which allows for efficient training using backpropagation.
arXiv Detail & Related papers (2023-02-21T20:23:56Z)
Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost [53.746169882193456]
Recent works have proposed various sparse attention modules to overcome the quadratic cost of self-attention. We propose a model that resolves both problems by endowing each attention head with a mixed-membership Block Model. Our model outperforms previous efficient variants as well as the original Transformer with full attention.
arXiv Detail & Related papers (2022-10-27T15:30:52Z)
GFlowOut: Dropout with Generative Flow Networks [76.59535235717631]
Monte Carlo Dropout has been widely used as a relatively cheap way for approximate Inference. Recent works show that the dropout mask can be viewed as a latent variable, which can be inferred with variational inference. GFlowOutleverages the recently proposed probabilistic framework of Generative Flow Networks (GFlowNets) to learn the posterior distribution over dropout masks.
arXiv Detail & Related papers (2022-10-24T03:00:01Z)
Transformers Can Do Bayesian Inference [56.99390658880008]
We present Prior-Data Fitted Networks (PFNs) PFNs leverage in-context learning in large-scale machine learning techniques to approximate a large set of posteriors. We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems.
arXiv Detail & Related papers (2021-12-20T13:07:39Z)
Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning [78.83598532168256]
Marginal-likelihood based model-selection is rarely used in deep learning due to estimation difficulties. Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable.
arXiv Detail & Related papers (2021-04-11T09:50:24Z)
Variational Laplace for Bayesian neural networks [25.055754094939527]
Variational Laplace exploits a local approximation of the likelihood to estimate the ELBO without the need for sampling the neural-network weights. We show that early-stopping can be avoided by increasing the learning rate for the variance parameters.
arXiv Detail & Related papers (2021-02-27T14:06:29Z)
Variational Laplace for Bayesian neural networks [33.46810568687292]
We develop variational Laplace for Bayesian neural networks (BNNs) We exploit a local approximation of the curvature of the likelihood to estimate the ELBO without the need for sampling the neural-network weights. We show that early-stopping can be avoided by increasing the learning rate for the variance parameters.
arXiv Detail & Related papers (2020-11-20T15:16:18Z)
Fast OSCAR and OWL Regression via Safe Screening Rules [97.28167655721766]
Ordered $L_1$ (OWL) regularized regression is a new regression analysis for high-dimensional sparse learning. Proximal gradient methods are used as standard approaches to solve OWL regression. We propose the first safe screening rule for OWL regression by exploring the order of the primal solution with the unknown order structure.
arXiv Detail & Related papers (2020-06-29T23:35:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.