Demystify Optimization Challenges in Multilingual Transformers
- URL: http://arxiv.org/abs/2104.07639v1
- Date: Thu, 15 Apr 2021 17:51:03 GMT
- Title: Demystify Optimization Challenges in Multilingual Transformers
- Authors: Xian Li, Hongyu Gong
- Abstract summary: We study optimization challenges from loss landscape and parameter plasticity perspectives.
We find that imbalanced training data poses task interference between high and low resource languages.
We propose Curvature Aware Task Scaling (CATS) which improves both optimization and generalization especially for low resource.
- Score: 21.245418118851884
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multilingual Transformer improves parameter efficiency and crosslingual
transfer. How to effectively train multilingual models has not been well
studied. Using multilingual machine translation as a testbed, we study
optimization challenges from loss landscape and parameter plasticity
perspectives. We found that imbalanced training data poses task interference
between high and low resource languages, characterized by nearly orthogonal
gradients for major parameters and the optimization trajectory being mostly
dominated by high resource. We show that local curvature of the loss surface
affects the degree of interference, and existing heuristics of data subsampling
implicitly reduces the sharpness, although still face a trade-off between high
and low resource languages. We propose a principled multi-objective
optimization algorithm, Curvature Aware Task Scaling (CATS), which improves
both optimization and generalization especially for low resource. Experiments
on TED, WMT and OPUS-100 benchmarks demonstrate that CATS advances the Pareto
front of accuracy while being efficient to apply to massive multilingual
settings at the scale of 100 languages.
Related papers
- X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale [25.257770733168012]
Large language models (LLMs) have achieved remarkable success across various NLP tasks, yet their focus has predominantly been on English.
In this paper, we prioritize quality over scaling number of languages, with a focus on multilingual machine translation task.
X-ALMA is a model designed with a commitment to ensuring top-tier performance across 50 diverse languages, regardless of their resource levels.
arXiv Detail & Related papers (2024-10-04T03:17:27Z) - On the Pareto Front of Multilingual Neural Machine Translation [123.94355117635293]
We study how the performance of a given direction changes with its sampling ratio in Neural Machine Translation (MNMT)
We propose the Double Power Law to predict the unique performance trade-off front in MNMT.
In our experiments, it achieves better performance than temperature searching and gradient manipulation methods with only 1/5 to 1/2 of the total training budget.
arXiv Detail & Related papers (2023-04-06T16:49:19Z) - High-resource Language-specific Training for Multilingual Neural Machine
Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference.
Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder.
HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z) - Parameter-Efficient Neural Reranking for Cross-Lingual and Multilingual
Retrieval [66.69799641522133]
State-of-the-art neural (re)rankers are notoriously data hungry.
Current approaches typically transfer rankers trained on English data to other languages and cross-lingual setups by means of multilingual encoders.
We show that two parameter-efficient approaches to cross-lingual transfer, namely Sparse Fine-Tuning Masks (SFTMs) and Adapters, allow for a more lightweight and more effective zero-shot transfer.
arXiv Detail & Related papers (2022-04-05T15:44:27Z) - Improving Multilingual Translation by Representation and Gradient
Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level.
Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - Gradient Vaccine: Investigating and Improving Multi-task Optimization in
Massively Multilingual Models [63.92643612630657]
This paper attempts to peek into the black-box of multilingual optimization through the lens of loss function geometry.
We find that gradient similarity measured along the optimization trajectory is an important signal, which correlates well with language proximity.
We derive a simple and scalable optimization procedure, named Gradient Vaccine, which encourages more geometrically aligned parameter updates for close tasks.
arXiv Detail & Related papers (2020-10-12T17:26:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.