No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for
Training Large Transformer Models
- URL: http://arxiv.org/abs/2202.02664v1
- Date: Sun, 6 Feb 2022 00:22:28 GMT
- Title: No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for
Training Large Transformer Models
- Authors: Chen Liang, Haoming Jiang, Simiao Zuo, Pengcheng He, Xiaodong Liu,
Jianfeng Gao, Weizhu Chen, Tuo Zhao
- Abstract summary: We propose a novel training strategy that encourages all parameters to be trained sufficiently.
A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate.
In contrast, a parameter with high sensitivity is well-trained and we regularize it by decreasing its learning rate to prevent further overfitting.
- Score: 132.90062129639705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research has shown the existence of significant redundancy in large
Transformer models. One can prune the redundant parameters without
significantly sacrificing the generalization performance. However, we question
whether the redundant parameters could have contributed more if they were
properly trained. To answer this question, we propose a novel training strategy
that encourages all parameters to be trained sufficiently. Specifically, we
adaptively adjust the learning rate for each parameter according to its
sensitivity, a robust gradient-based measure reflecting this parameter's
contribution to the model performance. A parameter with low sensitivity is
redundant, and we improve its fitting by increasing its learning rate. In
contrast, a parameter with high sensitivity is well-trained, and we regularize
it by decreasing its learning rate to prevent further overfitting. We conduct
extensive experiments on natural language understanding, neural machine
translation, and image classification to demonstrate the effectiveness of the
proposed schedule. Analysis shows that the proposed schedule indeed reduces the
redundancy and improves generalization performance.
Related papers
- SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation [52.6922833948127]
In this work, we investigate the importance of parameters in pre-trained diffusion models.
We propose a novel model fine-tuning method to make full use of these ineffective parameters.
Our method enhances the generative capabilities of pre-trained models in downstream applications.
arXiv Detail & Related papers (2024-09-10T16:44:47Z) - Scaling Exponents Across Parameterizations and Optimizers [94.54718325264218]
We propose a new perspective on parameterization by investigating a key assumption in prior work.
Our empirical investigation includes tens of thousands of models trained with all combinations of threes.
We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
arXiv Detail & Related papers (2024-07-08T12:32:51Z) - Sine Activated Low-Rank Matrices for Parameter Efficient Learning [25.12262017296922]
We propose a novel theoretical framework that integrates a sinusoidal function within the low-rank decomposition process.
Our method proves to be an enhancement for existing low-rank models, as evidenced by its successful application in Vision Transformers (ViT), Large Language Models (LLMs), Neural Radiance Fields (NeRF)
arXiv Detail & Related papers (2024-03-28T08:58:20Z) - Low-Rank Rescaled Vision Transformer Fine-Tuning: A Residual Design Approach [17.678759882763078]
Fine-tuning for pre-trained Vision Transformers aims to adeptly tailor a model to downstream tasks.
Striking a balance between retaining the generalizable representation capacity of the pre-trained model and acquiring task-specific features is a key challenge.
We propose a Residual-based Low-Rank Rescaling (RLRR) fine-tuning strategy.
arXiv Detail & Related papers (2024-03-28T00:14:53Z) - Sparse Low-rank Adaptation of Pre-trained Language Models [79.74094517030035]
We introduce sparse low-rank adaptation (SoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process.
Our approach strengthens the representation power of LoRA by initializing it with a higher rank, while efficiently taming a temporarily increased number of parameters.
Our experimental results demonstrate that SoRA can outperform other baselines even with 70% retained parameters and 70% training time.
arXiv Detail & Related papers (2023-11-20T11:56:25Z) - The Importance of Being Parameters: An Intra-Distillation Method for
Serious Gains [13.579368172149135]
We argue that redundant parameters can be trained to make beneficial contributions.
We propose a general task-agnostic method, namely intra-distillation, appended to the regular training loss to balance sensitivity.
Our experiments show the strong effectiveness of our methods on machine translation, natural language understanding, and zero-shot cross-lingual transfer.
arXiv Detail & Related papers (2022-05-23T16:01:46Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.