Distance-Based Regularisation of Deep Networks for Fine-Tuning
- URL: http://arxiv.org/abs/2002.08253v3
- Date: Fri, 15 Jan 2021 16:05:16 GMT
- Title: Distance-Based Regularisation of Deep Networks for Fine-Tuning
- Authors: Henry Gouk, Timothy M. Hospedales, Massimiliano Pontil
- Abstract summary: We develop an algorithm that constrains a hypothesis class to a small sphere centred on the initial pre-trained weights.
Empirical evaluation shows that our algorithm works well, corroborating our theoretical results.
- Score: 116.71288796019809
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate approaches to regularisation during fine-tuning of deep neural
networks. First we provide a neural network generalisation bound based on
Rademacher complexity that uses the distance the weights have moved from their
initial values. This bound has no direct dependence on the number of weights
and compares favourably to other bounds when applied to convolutional networks.
Our bound is highly relevant for fine-tuning, because providing a network with
a good initialisation based on transfer learning means that learning can modify
the weights less, and hence achieve tighter generalisation. Inspired by this,
we develop a simple yet effective fine-tuning algorithm that constrains the
hypothesis class to a small sphere centred on the initial pre-trained weights,
thus obtaining provably better generalisation performance than conventional
transfer learning. Empirical evaluation shows that our algorithm works well,
corroborating our theoretical results. It outperforms both state of the art
fine-tuning competitors, and penalty-based alternatives that we show do not
directly constrain the radius of the search space.
Related papers
- Concurrent Training and Layer Pruning of Deep Neural Networks [0.0]
We propose an algorithm capable of identifying and eliminating irrelevant layers of a neural network during the early stages of training.
We employ a structure using residual connections around nonlinear network sections that allow the flow of information through the network once a nonlinear section is pruned.
arXiv Detail & Related papers (2024-06-06T23:19:57Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - On the generalization of learning algorithms that do not converge [54.122745736433856]
Generalization analyses of deep learning typically assume that the training converges to a fixed point.
Recent results indicate that in practice, the weights of deep neural networks optimized with gradient descent often oscillate indefinitely.
arXiv Detail & Related papers (2022-08-16T21:22:34Z) - Robust Learning of Parsimonious Deep Neural Networks [0.0]
We propose a simultaneous learning and pruning algorithm capable of identifying and eliminating irrelevant structures in a neural network.
We derive a novel hyper-prior distribution over the prior parameters that is crucial for their optimal selection.
We evaluate the proposed algorithm on the MNIST data set and commonly used fully connected and convolutional LeNet architectures.
arXiv Detail & Related papers (2022-05-10T03:38:55Z) - Analytically Tractable Inference in Deep Neural Networks [0.0]
Tractable Approximate Inference (TAGI) algorithm was shown to be a viable and scalable alternative to backpropagation for shallow fully-connected neural networks.
We are demonstrating how TAGI matches or exceeds the performance of backpropagation, for training classic deep neural network architectures.
arXiv Detail & Related papers (2021-03-09T14:51:34Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - Fiedler Regularization: Learning Neural Networks with Graph Sparsity [6.09170287691728]
We introduce a novel regularization approach for deep learning that incorporates and respects the underlying graphical structure of the neural network.
We propose to use the Fiedler value of the neural network's underlying graph as a tool for regularization.
arXiv Detail & Related papers (2020-03-02T16:19:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.