Condensing Multilingual Knowledge with Lightweight Language-Specific
Modules
- URL: http://arxiv.org/abs/2305.13993v3
- Date: Sun, 22 Oct 2023 17:52:19 GMT
- Title: Condensing Multilingual Knowledge with Lightweight Language-Specific
Modules
- Authors: Haoran Xu, Weiting Tan, Shuyue Stella Li, Yunmo Chen, Benjamin Van
Durme, Philipp Koehn, Kenton Murray
- Abstract summary: We introduce the Language-Specific Matrix Synthesis (LMS) method.
This approach constructs LS modules by generating low-rank matrices from two significantly smaller matrices.
We condense multilingual knowledge from multiple LS modules into a single shared module with the Fuse Distillation (FD) technique.
- Score: 52.973832863842546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Incorporating language-specific (LS) modules is a proven method to boost
performance in multilingual machine translation. This approach bears similarity
to Mixture-of-Experts (MoE) because it does not inflate FLOPs. However, the
scalability of this approach to hundreds of languages (experts) tends to be
unmanageable due to the prohibitive number of parameters introduced by
full-rank matrices in fully-connected layers. In this work, we introduce the
Language-Specific Matrix Synthesis (LMS) method. This approach constructs LS
modules by generating low-rank matrices from two significantly smaller matrices
to approximate the full-rank matrix. Furthermore, we condense multilingual
knowledge from multiple LS modules into a single shared module with the Fuse
Distillation (FD) technique to improve the efficiency of inference and model
serialization. We show that our LMS method significantly outperforms previous
LS methods and MoE methods with the same amount of extra parameters, e.g., 1.73
BLEU points over the Switch Transformer on many-to-many multilingual machine
translation. Importantly, LMS is able to have comparable translation
performance with much fewer parameters.
Related papers
- LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM.
Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM.
We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z) - EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression [5.206085750261924]
Large Language Models (LLMs) require significant amount of memory storage in inference.
In this paper, we take a step further to explore parameter sharing across different layers with singular value decomposition.
Comprehensive experiments demonstrate that Basis Sharing outperforms state-of-the-art SVD-based compression approaches.
arXiv Detail & Related papers (2024-10-02T14:30:02Z) - Extend Model Merging from Fine-Tuned to Pre-Trained Large Language Models via Weight Disentanglement [72.97553348776425]
We make a pioneering effort to broaden the applicability of merging techniques from FT to PT LLMs.
We introduce an approach based on WeIght DisENtanglement (WIDEN) to effectively extend the merging scope.
We merge Qwen1.5-Chat (an FT LLM with instruction-following skills) with Sailor (a PT LLM with multilingual abilities) across 7B and 14B model scales.
arXiv Detail & Related papers (2024-08-06T10:46:46Z) - Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners [67.85635044939836]
Large Language Models (LLMs) have shown impressive language capabilities.
In this work, we investigate the spontaneous multilingual alignment improvement of LLMs.
We find that LLMs instruction-tuned on the question translation data (i.e. without annotated answers) are able to encourage the alignment between English and a wide range of languages.
arXiv Detail & Related papers (2024-05-22T16:46:19Z) - LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction [24.675876324457747]
Existing methods to fine-tune LLMs, like Adapter, Prefix-tuning, and LoRA, may compromise the innate abilities of LLMs.
We propose LLaMA-Excitor, a lightweight method that stimulates the LLMs' potential to better follow instructions by gradually paying more attention to worthwhile information.
LLaMA-Excitor is the only method that maintains basic capabilities while achieving a significant improvement.
arXiv Detail & Related papers (2024-04-01T04:39:21Z) - Examining Scaling and Transfer of Language Model Architectures for
Machine Translation [51.69212730675345]
Language models (LMs) process sequences in a single stack of layers, and encoder-decoder models (EncDec) utilize separate layer stacks for input and output processing.
In machine translation, EncDec has long been the favoured approach, but with few studies investigating the performance of LMs.
arXiv Detail & Related papers (2022-02-01T16:20:15Z) - Serial or Parallel? Plug-able Adapter for multilingual machine
translation [15.114588783601466]
We propose PAM, a Transformer model augmented with defusion adaptation for multilingual machine translation.
PAM consists of embedding and layer adapters to shift the word and intermediate representations towards language-specific ones.
Experiment results on IWSLT, OPUS-100, and WMT benchmarks show that method outperforms several strong competitors.
arXiv Detail & Related papers (2021-04-16T14:58:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.