Condensing Multilingual Knowledge with Lightweight Language-Specific
Modules
- URL: http://arxiv.org/abs/2305.13993v3
- Date: Sun, 22 Oct 2023 17:52:19 GMT
- Title: Condensing Multilingual Knowledge with Lightweight Language-Specific
Modules
- Authors: Haoran Xu, Weiting Tan, Shuyue Stella Li, Yunmo Chen, Benjamin Van
Durme, Philipp Koehn, Kenton Murray
- Abstract summary: We introduce the Language-Specific Matrix Synthesis (LMS) method.
This approach constructs LS modules by generating low-rank matrices from two significantly smaller matrices.
We condense multilingual knowledge from multiple LS modules into a single shared module with the Fuse Distillation (FD) technique.
- Score: 52.973832863842546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Incorporating language-specific (LS) modules is a proven method to boost
performance in multilingual machine translation. This approach bears similarity
to Mixture-of-Experts (MoE) because it does not inflate FLOPs. However, the
scalability of this approach to hundreds of languages (experts) tends to be
unmanageable due to the prohibitive number of parameters introduced by
full-rank matrices in fully-connected layers. In this work, we introduce the
Language-Specific Matrix Synthesis (LMS) method. This approach constructs LS
modules by generating low-rank matrices from two significantly smaller matrices
to approximate the full-rank matrix. Furthermore, we condense multilingual
knowledge from multiple LS modules into a single shared module with the Fuse
Distillation (FD) technique to improve the efficiency of inference and model
serialization. We show that our LMS method significantly outperforms previous
LS methods and MoE methods with the same amount of extra parameters, e.g., 1.73
BLEU points over the Switch Transformer on many-to-many multilingual machine
translation. Importantly, LMS is able to have comparable translation
performance with much fewer parameters.
Related papers
- A Parameter-efficient Language Extension Framework for Multilingual ASR [25.758826304861948]
We propose an architecture-based framework for language extension.
It is designed to be parameter-efficient, incrementally incorporating an add-on module to adapt to a new language.
Experiments are carried out on 5 new languages with a wide range of low-performing data sizes.
arXiv Detail & Related papers (2024-06-10T14:46:07Z) - Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners [67.85635044939836]
Large Language Models (LLMs) have shown impressive language capabilities.
In this work, we investigate the spontaneous multilingual alignment improvement of LLMs.
We find that LLMs instruction-tuned on the question translation data (i.e. without annotated answers) are able to encourage the alignment between English and a wide range of languages.
arXiv Detail & Related papers (2024-05-22T16:46:19Z) - LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction [24.675876324457747]
Existing methods to fine-tune LLMs, like Adapter, Prefix-tuning, and LoRA, may compromise the innate abilities of LLMs.
We propose LLaMA-Excitor, a lightweight method that stimulates the LLMs' potential to better follow instructions by gradually paying more attention to worthwhile information.
LLaMA-Excitor is the only method that maintains basic capabilities while achieving a significant improvement.
arXiv Detail & Related papers (2024-04-01T04:39:21Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z) - Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large
Language Models [77.2078051555533]
We propose a novel and affordable solution for the effective VL adaption of large language models (LLMs)
Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters.
MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions.
arXiv Detail & Related papers (2023-05-24T11:06:15Z) - Checks and Strategies for Enabling Code-Switched Machine Translation [22.67264032644644]
Code-switching is a common phenomenon among multilingual speakers, where alternation between two or more languages occurs within the context of a single conversation.
This work explores multilingual neural machine translation (NMT) models' ability to handle code-switched text.
arXiv Detail & Related papers (2022-10-11T02:25:21Z) - Examining Scaling and Transfer of Language Model Architectures for
Machine Translation [51.69212730675345]
Language models (LMs) process sequences in a single stack of layers, and encoder-decoder models (EncDec) utilize separate layer stacks for input and output processing.
In machine translation, EncDec has long been the favoured approach, but with few studies investigating the performance of LMs.
arXiv Detail & Related papers (2022-02-01T16:20:15Z) - Serial or Parallel? Plug-able Adapter for multilingual machine
translation [15.114588783601466]
We propose PAM, a Transformer model augmented with defusion adaptation for multilingual machine translation.
PAM consists of embedding and layer adapters to shift the word and intermediate representations towards language-specific ones.
Experiment results on IWSLT, OPUS-100, and WMT benchmarks show that method outperforms several strong competitors.
arXiv Detail & Related papers (2021-04-16T14:58:28Z) - XLM-T: Scaling up Multilingual Machine Translation with Pretrained
Cross-lingual Transformer Encoders [89.0059978016914]
We present XLM-T, which initializes the model with an off-the-shelf pretrained cross-lingual Transformer and fine-tunes it with multilingual parallel data.
This simple method achieves significant improvements on a WMT dataset with 10 language pairs and the OPUS-100 corpus with 94 pairs.
arXiv Detail & Related papers (2020-12-31T11:16:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.