Related papers: Towards Theoretically Inspired Neural Initialization Optimization

Towards Theoretically Inspired Neural Initialization Optimization

URL: http://arxiv.org/abs/2210.05956v1
Date: Wed, 12 Oct 2022 06:49:16 GMT
Title: Towards Theoretically Inspired Neural Initialization Optimization
Authors: Yibo Yang, Hong Wang, Haobo Yuan, Zhouchen Lin
Abstract summary: We propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network. We show that both the training and test performance of a network can be improved by maximizing GradCosine under norm constraint. Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost.
Score: 66.04735385415427
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated machine learning has been widely explored to reduce human efforts in designing neural architectures and looking for proper hyperparameters. In the domain of neural initialization, however, similar automated techniques have rarely been studied. Most existing initialization methods are handcrafted and highly dependent on specific architectures. In this paper, we propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network. Specifically, GradCosine is the cosine similarity of sample-wise gradients with respect to the initialized parameters. By analyzing the sample-wise optimization landscape, we show that both the training and test performance of a network can be improved by maximizing GradCosine under gradient norm constraint. Based on this observation, we further propose the neural initialization optimization (NIO) algorithm. Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost compared with the training time. With NIO, we improve the classification performance of a variety of neural architectures on CIFAR-10, CIFAR-100, and ImageNet. Moreover, we find that our method can even help to train large vision Transformer architecture without warmup.

Related papers

Advancing Neural Network Performance through Emergence-Promoting Initialization Scheme [0.0]
We introduce a novel yet straightforward neural network initialization scheme. Inspired by the concept of emergence and leveraging the emergence measures proposed by Li (2023), our method adjusts layer-wise weight scaling factors to achieve higher emergence values. We demonstrate substantial improvements in both model accuracy and training speed, with and without batch normalization.
arXiv Detail & Related papers (2024-07-26T18:56:47Z)
Growing Tiny Networks: Spotting Expressivity Bottlenecks and Fixing Them Optimally [2.645067871482715]
In machine learning tasks, one searches for an optimal function within a certain functional space. This way forces the evolution of the function during training to lie within the realm of what is expressible with the chosen architecture. We show that the information about desirable architectural changes, due to expressivity bottlenecks can be extracted from %the backpropagation.
arXiv Detail & Related papers (2024-05-30T08:23:56Z)
Principled Architecture-aware Scaling of Hyperparameters [69.98414153320894]
Training a high-quality deep neural network requires choosing suitable hyperparameters, which is a non-trivial and expensive process. In this work, we precisely characterize the dependence of initializations and maximal learning rates on the network architecture. We demonstrate that network rankings can be easily changed by better training networks in benchmarks.
arXiv Detail & Related papers (2024-02-27T11:52:49Z)
Forward-Forward Algorithm for Hyperspectral Image Classification: A Preliminary Study [0.0]
Forward-forward algorithm (FFA) computes local goodness functions to optimize network parameters. This study investigates the application of FFA for hyperspectral image classification.
arXiv Detail & Related papers (2023-07-01T05:39:28Z)
Learning Large-scale Neural Fields via Context Pruned Meta-Learning [60.93679437452872]
We introduce an efficient optimization-based meta-learning technique for large-scale neural field training. We show how gradient re-scaling at meta-test time allows the learning of extremely high-quality neural fields. Our framework is model-agnostic, intuitive, straightforward to implement, and shows significant reconstruction improvements for a wide range of signals.
arXiv Detail & Related papers (2023-02-01T17:32:16Z)
Differentiable Neural Architecture Learning for Efficient Neural Network Design [31.23038136038325]
We introduce a novel emph architecture parameterisation based on scaled sigmoid function. We then propose a general emphiable Neural Architecture Learning (DNAL) method to optimize the neural architecture without the need to evaluate candidate neural networks.
arXiv Detail & Related papers (2021-03-03T02:03:08Z)
GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks. It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value. It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z)
A Semi-Supervised Assessor of Neural Architectures [157.76189339451565]
We employ an auto-encoder to discover meaningful representations of neural architectures. A graph convolutional neural network is introduced to predict the performance of architectures.
arXiv Detail & Related papers (2020-05-14T09:02:33Z)
MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks. The use of gradient combined nonvolutionity renders learning susceptible to novel problems. We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.