No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for
Training Large Transformer Models
- URL: http://arxiv.org/abs/2202.02664v1
- Date: Sun, 6 Feb 2022 00:22:28 GMT
- Title: No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for
Training Large Transformer Models
- Authors: Chen Liang, Haoming Jiang, Simiao Zuo, Pengcheng He, Xiaodong Liu,
Jianfeng Gao, Weizhu Chen, Tuo Zhao
- Abstract summary: We propose a novel training strategy that encourages all parameters to be trained sufficiently.
A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate.
In contrast, a parameter with high sensitivity is well-trained and we regularize it by decreasing its learning rate to prevent further overfitting.
- Score: 132.90062129639705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research has shown the existence of significant redundancy in large
Transformer models. One can prune the redundant parameters without
significantly sacrificing the generalization performance. However, we question
whether the redundant parameters could have contributed more if they were
properly trained. To answer this question, we propose a novel training strategy
that encourages all parameters to be trained sufficiently. Specifically, we
adaptively adjust the learning rate for each parameter according to its
sensitivity, a robust gradient-based measure reflecting this parameter's
contribution to the model performance. A parameter with low sensitivity is
redundant, and we improve its fitting by increasing its learning rate. In
contrast, a parameter with high sensitivity is well-trained, and we regularize
it by decreasing its learning rate to prevent further overfitting. We conduct
extensive experiments on natural language understanding, neural machine
translation, and image classification to demonstrate the effectiveness of the
proposed schedule. Analysis shows that the proposed schedule indeed reduces the
redundancy and improves generalization performance.
Related papers
- A Hessian-informed hyperparameter optimization for differential learning rate [10.43211367988483]
Hessian-informed differential learning rate (Hi-DLR) is a technique that applies different learning rates to different model parameters.
Hi-DLR can improve the convergence by dynamically determining the learning rates during the training.
It also exhibits comparable performance on various full model training tasks.
arXiv Detail & Related papers (2025-01-12T22:21:06Z) - Rethinking Model Redundancy for Low-light Image Enhancement [21.864075752556452]
Low-light image enhancement (LLIE) is a fundamental task in computational photography, aiming to improve illumination, reduce noise, and enhance the image quality of low-light images.
Recent advancements primarily focus on customizing complex neural network models, but we have observed significant redundancy in these models, limiting further performance improvement.
Inspired by the rethinking, we propose two innovative techniques to mitigate model redundancy while improving the LLIE performance.
arXiv Detail & Related papers (2024-12-21T03:17:28Z) - SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation [52.6922833948127]
In this work, we investigate the importance of parameters in pre-trained diffusion models.
We propose a novel model fine-tuning method to make full use of these ineffective parameters.
Our method enhances the generative capabilities of pre-trained models in downstream applications.
arXiv Detail & Related papers (2024-09-10T16:44:47Z) - Scaling Exponents Across Parameterizations and Optimizers [94.54718325264218]
We propose a new perspective on parameterization by investigating a key assumption in prior work.
Our empirical investigation includes tens of thousands of models trained with all combinations of threes.
We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
arXiv Detail & Related papers (2024-07-08T12:32:51Z) - Sparse Low-rank Adaptation of Pre-trained Language Models [79.74094517030035]
We introduce sparse low-rank adaptation (SoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process.
Our approach strengthens the representation power of LoRA by initializing it with a higher rank, while efficiently taming a temporarily increased number of parameters.
Our experimental results demonstrate that SoRA can outperform other baselines even with 70% retained parameters and 70% training time.
arXiv Detail & Related papers (2023-11-20T11:56:25Z) - The Importance of Being Parameters: An Intra-Distillation Method for
Serious Gains [13.579368172149135]
We argue that redundant parameters can be trained to make beneficial contributions.
We propose a general task-agnostic method, namely intra-distillation, appended to the regular training loss to balance sensitivity.
Our experiments show the strong effectiveness of our methods on machine translation, natural language understanding, and zero-shot cross-lingual transfer.
arXiv Detail & Related papers (2022-05-23T16:01:46Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.