The Importance of Being Parameters: An Intra-Distillation Method for
Serious Gains
- URL: http://arxiv.org/abs/2205.11416v1
- Date: Mon, 23 May 2022 16:01:46 GMT
- Title: The Importance of Being Parameters: An Intra-Distillation Method for
Serious Gains
- Authors: Haoran Xu, Philipp Koehn, Kenton Murray
- Abstract summary: We argue that redundant parameters can be trained to make beneficial contributions.
We propose a general task-agnostic method, namely intra-distillation, appended to the regular training loss to balance sensitivity.
Our experiments show the strong effectiveness of our methods on machine translation, natural language understanding, and zero-shot cross-lingual transfer.
- Score: 13.579368172149135
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent model pruning methods have demonstrated the ability to remove
redundant parameters without sacrificing model performance. Common methods
remove redundant parameters according to the parameter sensitivity, a
gradient-based measure reflecting the contribution of the parameters. In this
paper, however, we argue that redundant parameters can be trained to make
beneficial contributions. We first highlight the large sensitivity
(contribution) gap among high-sensitivity and low-sensitivity parameters and
show that the model generalization performance can be significantly improved
after balancing the contribution of all parameters. Our goal is to balance the
sensitivity of all parameters and encourage all of them to contribute equally.
We propose a general task-agnostic method, namely intra-distillation, appended
to the regular training loss to balance parameter sensitivity. Moreover, we
also design a novel adaptive learning method to control the strength of
intra-distillation loss for faster convergence. Our experiments show the strong
effectiveness of our methods on machine translation, natural language
understanding, and zero-shot cross-lingual transfer across up to 48 languages,
e.g., a gain of 3.54 BLEU on average across 8 language pairs from the IWSLT'14
translation dataset.
Related papers
- LoRTA: Low Rank Tensor Adaptation of Large Language Models [70.32218116940393]
Low Rank Adaptation (LoRA) is a popular Efficient Fine Tuning (PEFT) method that effectively adapts large pre-trained models for downstream tasks.
We propose a novel approach that employs a low rank tensor parametrization for model updates.
Our method is both efficient and effective for fine-tuning large language models, achieving a substantial reduction in the number of parameters while maintaining comparable performance.
arXiv Detail & Related papers (2024-10-05T06:59:50Z) - Scaling Exponents Across Parameterizations and Optimizers [94.54718325264218]
We propose a new perspective on parameterization by investigating a key assumption in prior work.
Our empirical investigation includes tens of thousands of models trained with all combinations of threes.
We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
arXiv Detail & Related papers (2024-07-08T12:32:51Z) - Sine Activated Low-Rank Matrices for Parameter Efficient Learning [25.12262017296922]
We propose a novel theoretical framework that integrates a sinusoidal function within the low-rank decomposition process.
Our method proves to be an enhancement for existing low-rank models, as evidenced by its successful application in Vision Transformers (ViT), Large Language Models (LLMs), Neural Radiance Fields (NeRF)
arXiv Detail & Related papers (2024-03-28T08:58:20Z) - Parameter-Efficient Fine-Tuning without Introducing New Latency [7.631596468553607]
We introduce a novel adapter technique that directly applies the adapter to pre-trained parameters instead of the hidden representation.
Our proposed method attains a new state-of-the-art outcome in terms of both performance and storage efficiency, storing only 0.03% parameters of full fine-tuning.
arXiv Detail & Related papers (2023-05-26T08:44:42Z) - AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning [143.23123791557245]
Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP.
We propose AdaLoRA, which adaptively allocates the parameter budget among weight matrices according to their importance score.
We conduct extensive experiments with several pre-trained models on natural language processing, question answering, and natural language generation to validate the effectiveness of AdaLoRA.
arXiv Detail & Related papers (2023-03-18T22:36:25Z) - Sensitivity-Aware Visual Parameter-Efficient Fine-Tuning [91.5113227694443]
We propose a novel visual.
sensuous-aware fine-Tuning (SPT) scheme.
SPT allocates trainable parameters to task-specific important positions.
Experiments on a wide range of downstream recognition tasks show that our SPT is complementary to the existing PEFT methods.
arXiv Detail & Related papers (2023-03-15T12:34:24Z) - Know Where You're Going: Meta-Learning for Parameter-Efficient
Fine-tuning [34.66092282348687]
We show that taking the ultimate choice of fine-tuning method into consideration boosts the performance of parameter-efficient fine-tuning.
We prime the pretrained model specifically for parameter-efficient fine-tuning, resulting in gains of up to 1.7 points on cross-lingual NER fine-tuning.
arXiv Detail & Related papers (2022-05-25T02:51:57Z) - No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for
Training Large Transformer Models [132.90062129639705]
We propose a novel training strategy that encourages all parameters to be trained sufficiently.
A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate.
In contrast, a parameter with high sensitivity is well-trained and we regularize it by decreasing its learning rate to prevent further overfitting.
arXiv Detail & Related papers (2022-02-06T00:22:28Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.