Gradient Vaccine: Investigating and Improving Multi-task Optimization in
Massively Multilingual Models
- URL: http://arxiv.org/abs/2010.05874v1
- Date: Mon, 12 Oct 2020 17:26:34 GMT
- Title: Gradient Vaccine: Investigating and Improving Multi-task Optimization in
Massively Multilingual Models
- Authors: Zirui Wang, Yulia Tsvetkov, Orhan Firat, Yuan Cao
- Abstract summary: This paper attempts to peek into the black-box of multilingual optimization through the lens of loss function geometry.
We find that gradient similarity measured along the optimization trajectory is an important signal, which correlates well with language proximity.
We derive a simple and scalable optimization procedure, named Gradient Vaccine, which encourages more geometrically aligned parameter updates for close tasks.
- Score: 63.92643612630657
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Massively multilingual models subsuming tens or even hundreds of languages
pose great challenges to multi-task optimization. While it is a common practice
to apply a language-agnostic procedure optimizing a joint multilingual task
objective, how to properly characterize and take advantage of its underlying
problem structure for improving optimization efficiency remains under-explored.
In this paper, we attempt to peek into the black-box of multilingual
optimization through the lens of loss function geometry. We find that gradient
similarity measured along the optimization trajectory is an important signal,
which correlates well with not only language proximity but also the overall
model performance. Such observation helps us to identify a critical limitation
of existing gradient-based multi-task learning methods, and thus we derive a
simple and scalable optimization procedure, named Gradient Vaccine, which
encourages more geometrically aligned parameter updates for close tasks.
Empirically, our method obtains significant model performance gains on
multilingual machine translation and XTREME benchmark tasks for multilingual
language models. Our work reveals the importance of properly measuring and
utilizing language proximity in multilingual optimization, and has broader
implications for multi-task learning beyond multilingual modeling.
Related papers
- No Train but Gain: Language Arithmetic for training-free Language Adapters enhancement [59.37775534633868]
We introduce a novel method called language arithmetic, which enables training-free post-processing.
The effectiveness of the proposed solution is demonstrated on three downstream tasks in a MAD-X-based set of cross-lingual schemes.
arXiv Detail & Related papers (2024-04-24T08:52:40Z) - CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment [38.35458193262633]
English-centric models are usually suboptimal in other languages.
We propose a novel approach called CrossIn, which utilizes a mixed composition of cross-lingual instruction tuning data.
arXiv Detail & Related papers (2024-04-18T06:20:50Z) - On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based
Multilingual Model [49.81429697921861]
We study the interaction between parameter-efficient fine-tuning (PEFT) and cross-lingual tasks in multilingual autoregressive models.
We show that prompt tuning is more effective in enhancing the performance of low-resource languages than fine-tuning.
arXiv Detail & Related papers (2023-11-14T00:43:33Z) - Breaking Language Barriers with a LEAP: Learning Strategies for Polyglot
LLMs [5.682384717239095]
Large language models (LLMs) are at the forefront of transforming numerous domains globally.
This paper tackles the imperative challenge of enhancing the multilingual performance of LLMs.
We present novel techniques that unlock the true potential of LLMs in a polyglot landscape.
arXiv Detail & Related papers (2023-05-28T14:48:38Z) - Multi Task Learning For Zero Shot Performance Prediction of Multilingual
Models [12.759281077118567]
Massively Multilingual Transformer based Language Models have been observed to be surprisingly effective on zero-shot transfer across languages.
We build upon some of the existing techniques for predicting the zero-shot performance on a task, by modeling it as a multi-task learning problem.
arXiv Detail & Related papers (2022-05-12T14:47:03Z) - Visualizing the Relationship Between Encoded Linguistic Information and
Task Performance [53.223789395577796]
We study the dynamic relationship between the encoded linguistic information and task performance from the viewpoint of Pareto Optimality.
We conduct experiments on two popular NLP tasks, i.e., machine translation and language modeling, and investigate the relationship between several kinds of linguistic information and task performances.
Our empirical findings suggest that some syntactic information is helpful for NLP tasks whereas encoding more syntactic information does not necessarily lead to better performance.
arXiv Detail & Related papers (2022-03-29T19:03:10Z) - Sequential Reptile: Inter-Task Gradient Alignment for Multilingual
Learning [61.29879000628815]
We show that it is crucial for tasks to align gradients between them in order to maximize knowledge transfer.
We propose a simple yet effective method that can efficiently align gradients between tasks.
We extensively validate our method on various multi-task learning and zero-shot cross-lingual transfer tasks.
arXiv Detail & Related papers (2021-10-06T09:10:10Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - Demystify Optimization Challenges in Multilingual Transformers [21.245418118851884]
We study optimization challenges from loss landscape and parameter plasticity perspectives.
We find that imbalanced training data poses task interference between high and low resource languages.
We propose Curvature Aware Task Scaling (CATS) which improves both optimization and generalization especially for low resource.
arXiv Detail & Related papers (2021-04-15T17:51:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.