Related papers: Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training

Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training

URL: http://arxiv.org/abs/2312.00359v1
Date: Fri, 1 Dec 2023 05:38:17 GMT
Title: Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training
Authors: Yefan Zhou, Tianyu Pang, Keqin Liu, Charles H. Martin, Michael W. Mahoney, Yaoqing Yang
Abstract summary: This paper proposes TempBalance, a straightforward yet effective layerwise learning rate method. We show that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization. We also show that TempBalance outperforms a number of state-of-the-art metrics and schedulers.
Score: 58.20089993899729
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Regularization in modern machine learning is crucial, and it can take various forms in algorithmic design: training set, model family, error function, regularization terms, and optimizations. In particular, the learning rate, which can be interpreted as a temperature-like parameter within the statistical mechanics of learning, plays a crucial role in neural network training. Indeed, many widely adopted training strategies basically just define the decay of the learning rate over time. This process can be interpreted as decreasing a temperature, using either a global learning rate (for the entire model) or a learning rate that varies for each parameter. This paper proposes TempBalance, a straightforward yet effective layer-wise learning rate method. TempBalance is based on Heavy-Tailed Self-Regularization (HT-SR) Theory, an approach which characterizes the implicit self-regularization of different layers in trained models. We demonstrate the efficacy of using HT-SR-motivated metrics to guide the scheduling and balancing of temperature across all network layers during model training, resulting in improved performance during testing. We implement TempBalance on CIFAR10, CIFAR100, SVHN, and TinyImageNet datasets using ResNets, VGGs, and WideResNets with various depths and widths. Our results show that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization. We also show that TempBalance outperforms a number of state-of-the-art optimizers and learning rate schedulers.

Related papers

Impact of Hyperparameter Optimization on the Accuracy of Lightweight Deep Learning Models for Real-Time Image Classification [0.0]
This work analyzes the influence of hyper parameter adjustment on the accuracy and convergence behavior of seven efficient deep learning architectures.<n>All models are trained on the ImageNet-1K dataset under consistent training settings.<n>Results demonstrate that cosine learning rate decay and adjustable batch size may greatly boost both accuracy and convergence speed.
arXiv Detail & Related papers (2025-07-31T07:47:30Z)
Time-Aware World Model for Adaptive Prediction and Control [20.139507820478872]
Time-Aware World Model (TAWM) is a model-based approach that explicitly incorporates temporal dynamics.<n>TAWM learns both high- and low-frequency task dynamics across diverse control problems.<n> Empirical evaluations show that TAWM consistently outperforms conventional models.
arXiv Detail & Related papers (2025-06-10T04:28:11Z)
To Cool or not to Cool? Temperature Network Meets Large Foundation Models via DRO [68.69840111477367]
We present a principled framework for learning a small yet generalizable temperature prediction network (TempNet) to improve LFMs. Our experiments on LLMs and CLIP models demonstrate that TempNet greatly improves the performance of existing solutions or models.
arXiv Detail & Related papers (2024-04-06T09:55:03Z)
Always-Sparse Training by Growing Connections with Guided Stochastic Exploration [46.4179239171213]
We propose an efficient always-sparse training algorithm with excellent scaling to larger and sparser models. We evaluate our method on CIFAR-10/100 and ImageNet using VGG, and ViT models, and compare it against a range of sparsification methods.
arXiv Detail & Related papers (2024-01-12T21:32:04Z)
Unifying Synergies between Self-supervised Learning and Dynamic Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms. We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting. The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z)
The Underlying Correlated Dynamics in Neural Training [6.385006149689549]
Training of neural networks is a computationally intensive task. We propose a model based on the correlation of the parameters' dynamics, which dramatically reduces the dimensionality. This representation enhances the understanding of the underlying training dynamics and can pave the way for designing better acceleration techniques.
arXiv Detail & Related papers (2022-12-18T08:34:11Z)
A Fast and Efficient Conditional Learning for Tunable Trade-Off between Accuracy and Robustness [11.35810118757863]
Existing models that achieve state-of-the-art (SOTA) performance on both clean and adversarially-perturbed images rely on convolution operations conditioned with feature-wise linear modulation (FiLM) layers. We present a fast learnable once-for-all adversarial training (FLOAT) algorithm, which instead of the existing FiLM-based conditioning, presents a unique weight conditioned learning that requires no additional layer. In particular, we add scaled noise to the weight tensors that enables a trade-off between clean and adversarial performance.
arXiv Detail & Related papers (2022-03-28T19:25:36Z)
Functional Regularization for Reinforcement Learning via Learned Fourier Features [98.90474131452588]
We propose a simple architecture for deep reinforcement learning by embedding inputs into a learned Fourier basis. We show that it improves the sample efficiency of both state-based and image-based RL.
arXiv Detail & Related papers (2021-12-06T18:59:52Z)
LRTuner: A Learning Rate Tuner for Deep Neural Networks [10.913790890826785]
The choice of learning rate schedule determines the computational cost getting close to a minima, how close you actually get to the minima, and most importantly the kind of local minima (wide/narrow) attained. Current systems employ hand tuned learning rate schedules, which are painstakingly tuned for each network and dataset. We present LRTuner, a method for tuning the learning rate schedule as training proceeds.
arXiv Detail & Related papers (2021-05-30T13:06:26Z)
GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks. It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value. It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z)
Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem) AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient. Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.