The Privileged Students: On the Value of Initialization in Multilingual Knowledge Distillation
- URL: http://arxiv.org/abs/2406.16524v1
- Date: Mon, 24 Jun 2024 10:59:26 GMT
- Title: The Privileged Students: On the Value of Initialization in Multilingual Knowledge Distillation
- Authors: Haryo Akbarianto Wibowo, Thamar Solorio, Alham Fikri Aji,
- Abstract summary: Knowledge distillation (KD) has proven to be a successful strategy to improve the performance of a smaller model in many NLP tasks.
In this paper, we investigate the value of KD in multilingual settings.
- Score: 18.919374970049468
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Knowledge distillation (KD) has proven to be a successful strategy to improve the performance of a smaller model in many NLP tasks. However, most of the work in KD only explores monolingual scenarios. In this paper, we investigate the value of KD in multilingual settings. We find the significance of KD and model initialization by analyzing how well the student model acquires multilingual knowledge from the teacher model. Our proposed method emphasizes copying the teacher model's weights directly to the student model to enhance initialization. Our finding shows that model initialization using copy-weight from the fine-tuned teacher contributes the most compared to the distillation process itself across various multilingual settings. Furthermore, we demonstrate that efficient weight initialization preserves multilingual capabilities even in low-resource scenarios.
Related papers
- Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD)
The core of ATKD is to reduce rote learning and make teaching more diverse and flexible.
Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - MiniLLM: Knowledge Distillation of Large Language Models [112.93051247165089]
Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs)
We propose a KD approach that distills LLMs into smaller language models.
Our method is scalable for different model families with 120M to 13B parameters.
arXiv Detail & Related papers (2023-06-14T14:44:03Z) - DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model [16.31307448314024]
We propose DistilXLSR, a distilled cross-lingual speech representation model.
By randomly shuffling the phonemes of existing speech, we reduce the linguistic information and distill cross-lingual models using only English data.
Our method is proven to be generalizable to various languages/teacher models and has the potential to improve the cross-lingual performance of the English pre-trained models.
arXiv Detail & Related papers (2023-06-02T07:03:06Z) - Multi-level Distillation of Semantic Knowledge for Pre-training
Multilingual Language Model [15.839724725094916]
Multi-level Multilingual Knowledge Distillation (MMKD) is a novel method for improving multilingual language models.
We employ a teacher-student framework to adopt rich semantic representation knowledge in English BERT.
We conduct experiments on cross-lingual evaluation benchmarks including XNLI, PAWS-X, and XQuAD.
arXiv Detail & Related papers (2022-11-02T15:23:13Z) - UM4: Unified Multilingual Multiple Teacher-Student Model for
Zero-Resource Neural Machine Translation [102.04003089261761]
Multilingual neural machine translation (MNMT) enables one-pass translation using shared semantic space for all languages.
We propose a novel method, named as Unified Multilingual Multiple teacher-student Model for NMT (UM4)
Our method unifies source-teacher, target-teacher, and pivot-teacher models to guide the student model for the zero-resource translation.
arXiv Detail & Related papers (2022-07-11T14:22:59Z) - Towards Developing a Multilingual and Code-Mixed Visual Question
Answering System by Knowledge Distillation [20.33235443471006]
We propose a knowledge distillation approach to extend an English language-vision model (teacher) into an equally effective multilingual and code-mixed model (student)
We also create the large-scale multilingual and code-mixed VQA dataset in eleven different language setups.
Experimental results and in-depth analysis show the effectiveness of the proposed VQA model over the pre-trained language-vision models on eleven diverse language setups.
arXiv Detail & Related papers (2021-09-10T03:47:29Z) - MergeDistill: Merging Pre-trained Language Models using Distillation [5.396915402673246]
We propose MergeDistill, a framework to merge pre-trained LMs in a way that can best leverage their assets with minimal dependencies.
We demonstrate the applicability of our framework in a practical setting by leveraging pre-existing teacher LMs and training student LMs that perform competitively with or even outperform teacher LMs trained on several orders of magnitude more data and with a fixed model capacity.
arXiv Detail & Related papers (2021-06-05T08:22:05Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - Structure-Level Knowledge Distillation For Multilingual Sequence
Labeling [73.40368222437912]
We propose to reduce the gap between monolingual models and the unified multilingual model by distilling the structural knowledge of several monolingual models to the unified multilingual model (student)
Our experiments on 4 multilingual tasks with 25 datasets show that our approaches outperform several strong baselines and have stronger zero-shot generalizability than both the baseline model and teacher models.
arXiv Detail & Related papers (2020-04-08T07:14:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.