Composable Sparse Fine-Tuning for Cross-Lingual Transfer
- URL: http://arxiv.org/abs/2110.07560v1
- Date: Thu, 14 Oct 2021 17:27:29 GMT
- Title: Composable Sparse Fine-Tuning for Cross-Lingual Transfer
- Authors: Alan Ansell, Edoardo Maria Ponti, Anna Korhonen, Ivan Vuli\'c
- Abstract summary: Fine-tuning all parameters of a pre-trained model has become the mainstream approach for transfer learning.
We introduce a new fine-tuning method with both these desirable properties.
It outperforms adapters in zero-shot cross-lingual transfer by a large margin.
- Score: 56.86192078426372
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning all parameters of a pre-trained model has become the mainstream
approach for transfer learning. To increase its efficiency and prevent
catastrophic forgetting and interference, techniques like adapters and sparse
fine-tuning have been developed. Adapters are modular, as they can be combined
to adapt a model towards different facets of knowledge (e.g., dedicated
language and/or task adapters). Sparse fine-tuning is expressive, as it
controls the behavior of all model components. In this work, we introduce a new
fine-tuning method with both these desirable properties. In particular, we
learn sparse, real-valued masks based on a simple variant of the Lottery Ticket
Hypothesis. Task-specific masks are obtained from annotated data in a source
language, and language-specific masks from masked language modeling in a target
language. Both these masks can then be composed with the pre-trained model.
Unlike adapter-based fine-tuning, this method neither increases the number of
parameters at inference time nor alters the original model architecture. Most
importantly, it outperforms adapters in zero-shot cross-lingual transfer by a
large margin in a series of multilingual benchmarks, including Universal
Dependencies, MasakhaNER, and AmericasNLI. Based on an in-depth analysis, we
additionally find that sparsity is crucial to prevent both 1) interference
between the fine-tunings to be composed and 2) overfitting. We release the code
and models at https://github.com/cambridgeltl/composable-sft.
Related papers
- Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Efficient Adapter Finetuning for Tail Languages in Streaming
Multilingual ASR [44.949146169903074]
The heterogeneous nature and imbalanced data abundance of different languages may cause performance degradation.
Our proposed method brings 12.2% word error rate reduction on average and up to 37.5% on a single locale.
arXiv Detail & Related papers (2024-01-17T06:01:16Z) - Self-Evolution Learning for Discriminative Language Model Pretraining [103.57103957631067]
Self-Evolution learning (SE) is a simple and effective token masking and learning method.
SE focuses on learning the informative yet under-explored tokens and adaptively regularizes the training by introducing a novel Token-specific Label Smoothing approach.
arXiv Detail & Related papers (2023-05-24T16:00:54Z) - Hidden State Variability of Pretrained Language Models Can Guide
Computation Reduction for Transfer Learning [16.60284838029852]
We investigate whether one could make a task-specific selection on which subset of the layers to adapt.
We propose to select layers based on the variability of their hidden states given a task-specific corpus.
arXiv Detail & Related papers (2022-10-18T17:58:43Z) - AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large
Language Models [119.7093605087114]
Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters.
This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation.
We introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques.
arXiv Detail & Related papers (2022-05-24T23:41:22Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - UniLMv2: Pseudo-Masked Language Models for Unified Language Model
Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks.
Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.