Related papers: No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models

No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models

URL: http://arxiv.org/abs/2202.02664v1
Date: Sun, 6 Feb 2022 00:22:28 GMT
Title: No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models
Authors: Chen Liang, Haoming Jiang, Simiao Zuo, Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, Tuo Zhao
Abstract summary: We propose a novel training strategy that encourages all parameters to be trained sufficiently. A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate. In contrast, a parameter with high sensitivity is well-trained and we regularize it by decreasing its learning rate to prevent further overfitting.
Score: 132.90062129639705
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent research has shown the existence of significant redundancy in large Transformer models. One can prune the redundant parameters without significantly sacrificing the generalization performance. However, we question whether the redundant parameters could have contributed more if they were properly trained. To answer this question, we propose a novel training strategy that encourages all parameters to be trained sufficiently. Specifically, we adaptively adjust the learning rate for each parameter according to its sensitivity, a robust gradient-based measure reflecting this parameter's contribution to the model performance. A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate. In contrast, a parameter with high sensitivity is well-trained, and we regularize it by decreasing its learning rate to prevent further overfitting. We conduct extensive experiments on natural language understanding, neural machine translation, and image classification to demonstrate the effectiveness of the proposed schedule. Analysis shows that the proposed schedule indeed reduces the redundancy and improves generalization performance.

Related papers

A Hessian-informed hyperparameter optimization for differential learning rate [10.43211367988483]
Hessian-informed differential learning rate (Hi-DLR) is a technique that applies different learning rates to different model parameters. Hi-DLR can improve the convergence by dynamically determining the learning rates during the training. It also exhibits comparable performance on various full model training tasks.
arXiv Detail & Related papers (2025-01-12T22:21:06Z)
Rethinking Model Redundancy for Low-light Image Enhancement [21.864075752556452]
Low-light image enhancement (LLIE) is a fundamental task in computational photography, aiming to improve illumination, reduce noise, and enhance the image quality of low-light images. Recent advancements primarily focus on customizing complex neural network models, but we have observed significant redundancy in these models, limiting further performance improvement. Inspired by the rethinking, we propose two innovative techniques to mitigate model redundancy while improving the LLIE performance.
arXiv Detail & Related papers (2024-12-21T03:17:28Z)
SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation [52.6922833948127]
In this work, we investigate the importance of parameters in pre-trained diffusion models. We propose a novel model fine-tuning method to make full use of these ineffective parameters. Our method enhances the generative capabilities of pre-trained models in downstream applications.
arXiv Detail & Related papers (2024-09-10T16:44:47Z)
Scaling Exponents Across Parameterizations and Optimizers [94.54718325264218]
We propose a new perspective on parameterization by investigating a key assumption in prior work. Our empirical investigation includes tens of thousands of models trained with all combinations of threes. We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
arXiv Detail & Related papers (2024-07-08T12:32:51Z)
Sine Activated Low-Rank Matrices for Parameter Efficient Learning [25.12262017296922]
We propose a novel theoretical framework that integrates a sinusoidal function within the low-rank decomposition process. Our method proves to be an enhancement for existing low-rank models, as evidenced by its successful application in Vision Transformers (ViT), Large Language Models (LLMs), Neural Radiance Fields (NeRF)
arXiv Detail & Related papers (2024-03-28T08:58:20Z)
Low-Rank Rescaled Vision Transformer Fine-Tuning: A Residual Design Approach [17.678759882763078]
Fine-tuning for pre-trained Vision Transformers aims to adeptly tailor a model to downstream tasks. Striking a balance between retaining the generalizable representation capacity of the pre-trained model and acquiring task-specific features is a key challenge. We propose a Residual-based Low-Rank Rescaling (RLRR) fine-tuning strategy.
arXiv Detail & Related papers (2024-03-28T00:14:53Z)
Sparse Low-rank Adaptation of Pre-trained Language Models [79.74094517030035]
We introduce sparse low-rank adaptation (SoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process. Our approach strengthens the representation power of LoRA by initializing it with a higher rank, while efficiently taming a temporarily increased number of parameters. Our experimental results demonstrate that SoRA can outperform other baselines even with 70% retained parameters and 70% training time.
arXiv Detail & Related papers (2023-11-20T11:56:25Z)
The Importance of Being Parameters: An Intra-Distillation Method for Serious Gains [13.579368172149135]
We argue that redundant parameters can be trained to make beneficial contributions. We propose a general task-agnostic method, namely intra-distillation, appended to the regular training loss to balance sensitivity. Our experiments show the strong effectiveness of our methods on machine translation, natural language understanding, and zero-shot cross-lingual transfer.
arXiv Detail & Related papers (2022-05-23T16:01:46Z)
Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem) AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient. Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.