Demystify Optimization and Generalization of Over-parameterized
PAC-Bayesian Learning
- URL: http://arxiv.org/abs/2202.01958v1
- Date: Fri, 4 Feb 2022 03:49:11 GMT
- Title: Demystify Optimization and Generalization of Over-parameterized
PAC-Bayesian Learning
- Authors: Wei Huang, Chunrui Liu, Yilan Chen, Tianyu Liu, and Richard Yi Da Xu
- Abstract summary: PAC-Bayesian is an analysis framework where the training error can be expressed as the weighted average of the hypotheses in the posterior distribution.
We show that when PAC-Bayes learning is applied, the convergence result corresponds to solving a kernel ridge regression.
We further characterize the uniform PAC-Bayesian generalization bound which improves over the Rademacher complexity-based bound for non-probabilistic neural network.
- Score: 20.295960197612743
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: PAC-Bayesian is an analysis framework where the training error can be
expressed as the weighted average of the hypotheses in the posterior
distribution whilst incorporating the prior knowledge. In addition to being a
pure generalization bound analysis tool, PAC-Bayesian bound can also be
incorporated into an objective function to train a probabilistic neural
network, making them a powerful and relevant framework that can numerically
provide a tight generalization bound for supervised learning. For simplicity,
we call probabilistic neural network learned using training objectives derived
from PAC-Bayesian bounds as {\it PAC-Bayesian learning}. Despite their
empirical success, the theoretical analysis of PAC-Bayesian learning for neural
networks is rarely explored. This paper proposes a new class of convergence and
generalization analysis for PAC-Bayes learning when it is used to train the
over-parameterized neural networks by the gradient descent method. For a wide
probabilistic neural network, we show that when PAC-Bayes learning is applied,
the convergence result corresponds to solving a kernel ridge regression when
the probabilistic neural tangent kernel (PNTK) is used as its kernel. Based on
this finding, we further characterize the uniform PAC-Bayesian generalization
bound which improves over the Rademacher complexity-based bound for
non-probabilistic neural network. Finally, drawing the insight from our
theoretical results, we propose a proxy measure for efficient hyperparameters
selection, which is proven to be time-saving.
Related papers
- Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Scalable PAC-Bayesian Meta-Learning via the PAC-Optimal Hyper-Posterior:
From Theory to Practice [54.03076395748459]
A central question in the meta-learning literature is how to regularize to ensure generalization to unseen tasks.
We present a generalization bound for meta-learning, which was first derived by Rothfuss et al.
We provide a theoretical analysis and empirical case study under which conditions and to what extent these guarantees for meta-learning improve upon PAC-Bayesian per-task learning bounds.
arXiv Detail & Related papers (2022-11-14T08:51:04Z) - PAC-Bayesian Learning of Aggregated Binary Activated Neural Networks
with Probabilities over Representations [2.047424180164312]
We study the expectation of a probabilistic neural network as a predictor by itself, focusing on the aggregation of binary activated neural networks with normal distributions over real-valued weights.
We show that the exact computation remains tractable for deep but narrow neural networks, thanks to a dynamic programming approach.
arXiv Detail & Related papers (2021-10-28T14:11:07Z) - Wide stochastic networks: Gaussian limit and PAC-Bayesian training [21.979820411421827]
We show that an extremely large network is approximated by a Gaussian process, both before and during training.
The explicit evaluation of the output distribution allows for a PAC-Bayesian training procedure that directly optimize the bound.
For a large but finite-width network, we show empirically on MNIST that this training approach can outperform standard PAC-Bayesian methods.
arXiv Detail & Related papers (2021-06-17T20:25:38Z) - Training Sparse Neural Network by Constraining Synaptic Weight on Unit
Lp Sphere [2.429910016019183]
constraining the synaptic weights on unit Lp-sphere enables the flexibly control of the sparsity with p.
Our approach is validated by experiments on benchmark datasets covering a wide range of domains.
arXiv Detail & Related papers (2021-03-30T01:02:31Z) - Analytically Tractable Inference in Deep Neural Networks [0.0]
Tractable Approximate Inference (TAGI) algorithm was shown to be a viable and scalable alternative to backpropagation for shallow fully-connected neural networks.
We are demonstrating how TAGI matches or exceeds the performance of backpropagation, for training classic deep neural network architectures.
arXiv Detail & Related papers (2021-03-09T14:51:34Z) - A General Framework for the Practical Disintegration of PAC-Bayesian
Bounds [2.516393111664279]
We introduce new PAC-Bayesian generalization bounds that have the originality to provide disintegrated bounds.
Our bounds are easily optimizable and can be used to design learning algorithms.
arXiv Detail & Related papers (2021-02-17T09:36:46Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Generalization bound of globally optimal non-convex neural network
training: Transportation map estimation by infinite dimensional Langevin
dynamics [50.83356836818667]
We introduce a new theoretical framework to analyze deep learning optimization with connection to its generalization error.
Existing frameworks such as mean field theory and neural tangent kernel theory for neural network optimization analysis typically require taking limit of infinite width of the network to show its global convergence.
arXiv Detail & Related papers (2020-07-11T18:19:50Z) - PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees [77.67258935234403]
We provide a theoretical analysis using the PAC-Bayesian framework and derive novel generalization bounds for meta-learning.
We develop a class of PAC-optimal meta-learning algorithms with performance guarantees and a principled meta-level regularization.
arXiv Detail & Related papers (2020-02-13T15:01:38Z) - Understanding Generalization in Deep Learning via Tensor Methods [53.808840694241]
We advance the understanding of the relations between the network's architecture and its generalizability from the compression perspective.
We propose a series of intuitive, data-dependent and easily-measurable properties that tightly characterize the compressibility and generalizability of neural networks.
arXiv Detail & Related papers (2020-01-14T22:26:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.