A Hessian-informed hyperparameter optimization for differential learning rate
- URL: http://arxiv.org/abs/2501.06954v2
- Date: Sun, 18 May 2025 15:46:19 GMT
- Title: A Hessian-informed hyperparameter optimization for differential learning rate
- Authors: Shiyun Xu, Zhiqi Bu, Yiliang Zhang, Ian Barnett,
- Abstract summary: Hessian-informed differential learning rate (Hi-DLR) is a technique that applies different learning rates to different model parameters.<n>We show that Hi-DLR can improve the convergence by dynamically determining the learning rates during the training.
- Score: 10.43211367988483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Differential learning rate (DLR), a technique that applies different learning rates to different model parameters, has been widely used in deep learning and achieved empirical success via its various forms. For example, parameter-efficient fine-tuning (PEFT) applies zero learning rates to most parameters so as to significantly save the computational cost. At the core, DLR leverages the observation that different parameters can have different loss curvature, which is hard to characterize in general. We propose the Hessian-informed differential learning rate (Hi-DLR), an efficient approach that solves the hyperparameter optimization (HPO) of learning rates and captures the loss curvature for any model and optimizer adaptively. Given a proper grouping of parameters, we empirically demonstrate that Hi-DLR can improve the convergence by dynamically determining the learning rates during the training.
Related papers
- Advantageous Parameter Expansion Training Makes Better Large Language Models [50.82647159657912]
A subset of parameters, termed advantageous parameters, plays a crucial role in determining model performance.<n>We propose Advantageous EXpansion Training (APEX), a method that progressively expands advantageous parameters into the space of disadvantageous ones.<n>APEX achieves the same perplexity level as conventional training with just 33% of the training data, and yields significant improvements on downstream tasks.
arXiv Detail & Related papers (2025-05-30T06:06:23Z) - Hyperparameter Optimisation with Practical Interpretability and Explanation Methods in Probabilistic Curriculum Learning [2.5352713493505785]
Probabilistic Curriculum Learning (PCL) is a curriculum learning strategy designed to improve RL performance by structuring the agent's learning process.
We provide an empirical analysis of hyperparameter interactions and their effects on the performance of a PCL algorithm within standard RL tasks.
arXiv Detail & Related papers (2025-04-09T08:41:27Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - A Method for Evaluating Hyperparameter Sensitivity in Reinforcement Learning [11.929813643723413]
This work proposes a new empirical methodology for studying, comparing, and quantifying the sensitivity of an algorithm's performance to hyperparameter tuning.<n>The results suggest that several algorithmic performance improvements may, in fact, be a result of an increased reliance on hyperparameter tuning.
arXiv Detail & Related papers (2024-12-10T03:55:18Z) - Scaling Exponents Across Parameterizations and Optimizers [94.54718325264218]
We propose a new perspective on parameterization by investigating a key assumption in prior work.
Our empirical investigation includes tens of thousands of models trained with all combinations of threes.
We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
arXiv Detail & Related papers (2024-07-08T12:32:51Z) - Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values.
We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO)
Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z) - ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections [59.839926875976225]
We propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections.
In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters.
arXiv Detail & Related papers (2024-05-30T17:26:02Z) - Diffusion Tempering Improves Parameter Estimation with Probabilistic Integrators for Ordinary Differential Equations [34.500484733973536]
Ordinary differential equations (ODEs) are widely used to describe dynamical systems in science, but identifying parameters that explain experimental measurements is challenging.
We propose diffusion tempering, a novel regularization technique for probabilistic numerical methods which improves convergence of gradient-based parameter optimization in ODEs.
We demonstrate that our method is effective for dynamical systems of different complexity and show that it obtains reliable parameter estimates for a Hodgkin-Huxley model with a practically relevant number of parameters.
arXiv Detail & Related papers (2024-02-19T15:36:36Z) - Boosting Inference Efficiency: Unleashing the Power of Parameter-Shared
Pre-trained Language Models [109.06052781040916]
We introduce a technique to enhance the inference efficiency of parameter-shared language models.
We also propose a simple pre-training technique that leads to fully or partially shared models.
Results demonstrate the effectiveness of our methods on both autoregressive and autoencoding PLMs.
arXiv Detail & Related papers (2023-10-19T15:13:58Z) - Improving Hyperparameter Learning under Approximate Inference in
Gaussian Process Models [18.134776677795077]
We focus on the interplay between variational inference (VI) and the learning target.
We design a hybrid training procedure to bring the best of both worlds: it leverages conjugate-computation VI for inference.
We empirically demonstrate the effectiveness of our proposal across a wide range of data sets.
arXiv Detail & Related papers (2023-06-07T07:15:08Z) - No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for
Training Large Transformer Models [132.90062129639705]
We propose a novel training strategy that encourages all parameters to be trained sufficiently.
A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate.
In contrast, a parameter with high sensitivity is well-trained and we regularize it by decreasing its learning rate to prevent further overfitting.
arXiv Detail & Related papers (2022-02-06T00:22:28Z) - Learning to Refit for Convex Learning Problems [11.464758257681197]
We propose a framework to learn to estimate optimized model parameters for different training sets using neural networks.
We rigorously characterize the power of neural networks to approximate convex problems.
arXiv Detail & Related papers (2021-11-24T15:28:50Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.