Related papers: Demystify Optimization Challenges in Multilingual Transformers

Demystify Optimization Challenges in Multilingual Transformers

URL: http://arxiv.org/abs/2104.07639v1
Date: Thu, 15 Apr 2021 17:51:03 GMT
Title: Demystify Optimization Challenges in Multilingual Transformers
Authors: Xian Li, Hongyu Gong
Abstract summary: We study optimization challenges from loss landscape and parameter plasticity perspectives. We find that imbalanced training data poses task interference between high and low resource languages. We propose Curvature Aware Task Scaling (CATS) which improves both optimization and generalization especially for low resource.
Score: 21.245418118851884
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multilingual Transformer improves parameter efficiency and crosslingual transfer. How to effectively train multilingual models has not been well studied. Using multilingual machine translation as a testbed, we study optimization challenges from loss landscape and parameter plasticity perspectives. We found that imbalanced training data poses task interference between high and low resource languages, characterized by nearly orthogonal gradients for major parameters and the optimization trajectory being mostly dominated by high resource. We show that local curvature of the loss surface affects the degree of interference, and existing heuristics of data subsampling implicitly reduces the sharpness, although still face a trade-off between high and low resource languages. We propose a principled multi-objective optimization algorithm, Curvature Aware Task Scaling (CATS), which improves both optimization and generalization especially for low resource. Experiments on TED, WMT and OPUS-100 benchmarks demonstrate that CATS advances the Pareto front of accuracy while being efficient to apply to massive multilingual settings at the scale of 100 languages.

Related papers

Adapting Where It Matters: Depth-Aware Adaptation for Efficient Multilingual Speech Recognition in Low-Resource Languages [11.808922632545874]
We analyze multilingual automatic speech recognition models and reveal a U-shaped adaptability pattern.<n>We propose DAMA, a Depth-Aware Model Adaptation framework that allocates adaptation capacity according to each layer's role.<n>Dama matches or surpasses state-of-the-art accuracy with 80% fewer trainable parameters, achieves a 29% error reduction under extreme data scarcity, and significantly improves memory, training time, and computational efficiency over baselines.
arXiv Detail & Related papers (2026-02-01T04:18:31Z)
Breaking Physical and Linguistic Borders: Multilingual Federated Prompt Tuning for Low-Resource Languages [27.63253872229416]
We propose a Federated Prompt Tuning Paradigm for multilingual scenarios.<n>Our approach achieves 6.9% higher accuracy with improved data efficiency.<n>These findings underscore the potential of our approach to promote social equality and champion linguistic diversity.
arXiv Detail & Related papers (2025-07-02T05:23:20Z)
Adaptive Rank Allocation for Federated Parameter-Efficient Fine-Tuning of Language Models [40.69348434971122]
We propose FedARA, a novel Adaptive Rank Allocation framework for federated parameter-efficient fine-tuning of language models. FedARA consistently outperforms baselines by an average of 6.95% to 8.49% across various datasets and models under heterogeneous data. Experiments on various edge devices demonstrate substantial decreases in total training time and energy consumption by up to 48.90% and 46.95%, respectively.
arXiv Detail & Related papers (2025-01-24T11:19:07Z)
Language Fusion for Parameter-Efficient Cross-lingual Transfer [21.96231169571248]
Fusion forLanguage Representations (FLARE) is a novel method that enhances representation quality and downstream performance for languages other than English. FLARE integrates source and target language representations within low-rank (LoRA) adapters using lightweight linear transformations. A series of experiments across representative cross-lingual natural language understanding tasks, including natural language inference, question-answering and sentiment analysis, demonstrate FLARE's effectiveness.
arXiv Detail & Related papers (2025-01-12T18:02:29Z)
X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale [25.257770733168012]
Large language models (LLMs) have achieved remarkable success across various NLP tasks, yet their focus has predominantly been on English. In this paper, we prioritize quality over scaling number of languages, with a focus on multilingual machine translation task. X-ALMA is a model designed with a commitment to ensuring top-tier performance across 50 diverse languages, regardless of their resource levels.
arXiv Detail & Related papers (2024-10-04T03:17:27Z)
On the Pareto Front of Multilingual Neural Machine Translation [123.94355117635293]
We study how the performance of a given direction changes with its sampling ratio in Neural Machine Translation (MNMT) We propose the Double Power Law to predict the unique performance trade-off front in MNMT. In our experiments, it achieves better performance than temperature searching and gradient manipulation methods with only 1/5 to 1/2 of the total training budget.
arXiv Detail & Related papers (2023-04-06T16:49:19Z)
High-resource Language-specific Training for Multilingual Neural Machine Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference. Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder. HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z)
Parameter-Efficient Neural Reranking for Cross-Lingual and Multilingual Retrieval [66.69799641522133]
State-of-the-art neural (re)rankers are notoriously data hungry. Current approaches typically transfer rankers trained on English data to other languages and cross-lingual setups by means of multilingual encoders. We show that two parameter-efficient approaches to cross-lingual transfer, namely Sparse Fine-Tuning Masks (SFTMs) and Adapters, allow for a more lightweight and more effective zero-shot transfer.
arXiv Detail & Related papers (2022-04-05T15:44:27Z)
Improving Multilingual Translation by Representation and Gradient Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level. Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z)
Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization. We show how to practically optimize this objective for large translation corpora using an iterated best response scheme. Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z)
Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models [63.92643612630657]
This paper attempts to peek into the black-box of multilingual optimization through the lens of loss function geometry. We find that gradient similarity measured along the optimization trajectory is an important signal, which correlates well with language proximity. We derive a simple and scalable optimization procedure, named Gradient Vaccine, which encourages more geometrically aligned parameter updates for close tasks.
arXiv Detail & Related papers (2020-10-12T17:26:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.