Temperature Balancing, Layer-wise Weight Analysis, and Neural Network
Training
- URL: http://arxiv.org/abs/2312.00359v1
- Date: Fri, 1 Dec 2023 05:38:17 GMT
- Title: Temperature Balancing, Layer-wise Weight Analysis, and Neural Network
Training
- Authors: Yefan Zhou, Tianyu Pang, Keqin Liu, Charles H. Martin, Michael W.
Mahoney, Yaoqing Yang
- Abstract summary: This paper proposes TempBalance, a straightforward yet effective layerwise learning rate method.
We show that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization.
We also show that TempBalance outperforms a number of state-of-the-art metrics and schedulers.
- Score: 58.20089993899729
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Regularization in modern machine learning is crucial, and it can take various
forms in algorithmic design: training set, model family, error function,
regularization terms, and optimizations. In particular, the learning rate,
which can be interpreted as a temperature-like parameter within the statistical
mechanics of learning, plays a crucial role in neural network training. Indeed,
many widely adopted training strategies basically just define the decay of the
learning rate over time. This process can be interpreted as decreasing a
temperature, using either a global learning rate (for the entire model) or a
learning rate that varies for each parameter. This paper proposes TempBalance,
a straightforward yet effective layer-wise learning rate method. TempBalance is
based on Heavy-Tailed Self-Regularization (HT-SR) Theory, an approach which
characterizes the implicit self-regularization of different layers in trained
models. We demonstrate the efficacy of using HT-SR-motivated metrics to guide
the scheduling and balancing of temperature across all network layers during
model training, resulting in improved performance during testing. We implement
TempBalance on CIFAR10, CIFAR100, SVHN, and TinyImageNet datasets using
ResNets, VGGs, and WideResNets with various depths and widths. Our results show
that TempBalance significantly outperforms ordinary SGD and carefully-tuned
spectral norm regularization. We also show that TempBalance outperforms a
number of state-of-the-art optimizers and learning rate schedulers.
Related papers
- To Cool or not to Cool? Temperature Network Meets Large Foundation Models via DRO [68.69840111477367]
We present a principled framework for learning a small yet generalizable temperature prediction network (TempNet) to improve LFMs.
Our experiments on LLMs and CLIP models demonstrate that TempNet greatly improves the performance of existing solutions or models.
arXiv Detail & Related papers (2024-04-06T09:55:03Z) - Always-Sparse Training by Growing Connections with Guided Stochastic
Exploration [46.4179239171213]
We propose an efficient always-sparse training algorithm with excellent scaling to larger and sparser models.
We evaluate our method on CIFAR-10/100 and ImageNet using VGG, and ViT models, and compare it against a range of sparsification methods.
arXiv Detail & Related papers (2024-01-12T21:32:04Z) - Unifying Synergies between Self-supervised Learning and Dynamic
Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms.
We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting.
The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z) - The Underlying Correlated Dynamics in Neural Training [6.385006149689549]
Training of neural networks is a computationally intensive task.
We propose a model based on the correlation of the parameters' dynamics, which dramatically reduces the dimensionality.
This representation enhances the understanding of the underlying training dynamics and can pave the way for designing better acceleration techniques.
arXiv Detail & Related papers (2022-12-18T08:34:11Z) - A Fast and Efficient Conditional Learning for Tunable Trade-Off between
Accuracy and Robustness [11.35810118757863]
Existing models that achieve state-of-the-art (SOTA) performance on both clean and adversarially-perturbed images rely on convolution operations conditioned with feature-wise linear modulation (FiLM) layers.
We present a fast learnable once-for-all adversarial training (FLOAT) algorithm, which instead of the existing FiLM-based conditioning, presents a unique weight conditioned learning that requires no additional layer.
In particular, we add scaled noise to the weight tensors that enables a trade-off between clean and adversarial performance.
arXiv Detail & Related papers (2022-03-28T19:25:36Z) - Functional Regularization for Reinforcement Learning via Learned Fourier
Features [98.90474131452588]
We propose a simple architecture for deep reinforcement learning by embedding inputs into a learned Fourier basis.
We show that it improves the sample efficiency of both state-based and image-based RL.
arXiv Detail & Related papers (2021-12-06T18:59:52Z) - LRTuner: A Learning Rate Tuner for Deep Neural Networks [10.913790890826785]
The choice of learning rate schedule determines the computational cost getting close to a minima, how close you actually get to the minima, and most importantly the kind of local minima (wide/narrow) attained.
Current systems employ hand tuned learning rate schedules, which are painstakingly tuned for each network and dataset.
We present LRTuner, a method for tuning the learning rate schedule as training proceeds.
arXiv Detail & Related papers (2021-05-30T13:06:26Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.