AMTSS: An Adaptive Multi-Teacher Single-Student Knowledge Distillation
Framework For Multilingual Language Inference
- URL: http://arxiv.org/abs/2305.07928v1
- Date: Sat, 13 May 2023 14:42:30 GMT
- Title: AMTSS: An Adaptive Multi-Teacher Single-Student Knowledge Distillation
Framework For Multilingual Language Inference
- Authors: Qianglong Chen, Feng Ji, Feng-Lin Li, Guohai Xu, Ming Yan, Ji Zhang
and Yin Zhang
- Abstract summary: AMTSS is an adaptive multi-teacher single-student distillation framework.
We first introduce an adaptive learning strategy and teacher importance weight, which enables a student to effectively learn from max-margin teachers.
We present a shared student with different projection layers in support of multiple languages, which contributes to largely reducing development and machine cost.
- Score: 27.333905128454546
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Knowledge distillation is of key importance to launching multilingual
pre-trained language models for real applications. To support cost-effective
language inference in multilingual settings, we propose AMTSS, an adaptive
multi-teacher single-student distillation framework, which allows distilling
knowledge from multiple teachers to a single student. We first introduce an
adaptive learning strategy and teacher importance weight, which enables a
student to effectively learn from max-margin teachers and easily adapt to new
languages. Moreover, we present a shared student encoder with different
projection layers in support of multiple languages, which contributes to
largely reducing development and machine cost. Experimental results show that
AMTSS gains competitive results on the public XNLI dataset and the realistic
industrial dataset AliExpress (AE) in the E-commerce scenario.
Related papers
- Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand
for Multilingual Instructions? [44.2017377417911]
We show that instruction-tuning on parallel instead of monolingual corpora benefits cross-lingual instruction following capabilities by up to 4.6%.
We also conduct a human annotation study to understand the alignment between human-based and GPT-4-based evaluation within multilingual chat scenarios.
arXiv Detail & Related papers (2024-02-21T11:07:07Z) - UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - UM4: Unified Multilingual Multiple Teacher-Student Model for
Zero-Resource Neural Machine Translation [102.04003089261761]
Multilingual neural machine translation (MNMT) enables one-pass translation using shared semantic space for all languages.
We propose a novel method, named as Unified Multilingual Multiple teacher-student Model for NMT (UM4)
Our method unifies source-teacher, target-teacher, and pivot-teacher models to guide the student model for the zero-resource translation.
arXiv Detail & Related papers (2022-07-11T14:22:59Z) - Large-scale Bilingual Language-Image Contrastive Learning [17.19890778916312]
We collect 1.1 billion image-text pairs (708 million Korean and 476 million English) and train a bilingual multimodal model named KELIP.
We introduce simple yet effective training schemes, including MAE pre-training and multi-crop augmentation.
Experiments demonstrate that a model trained with such training schemes shows competitive performance in both languages.
arXiv Detail & Related papers (2022-03-28T03:02:03Z) - Breaking Down Multilingual Machine Translation [74.24795388967907]
We show that multilingual training is beneficial to encoders in general, while it only benefits decoders for low-resource languages (LRLs)
Our many-to-one models for high-resource languages and one-to-many models for LRLs outperform the best results reported by Aharoni et al.
arXiv Detail & Related papers (2021-10-15T14:57:12Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Towards Developing a Multilingual and Code-Mixed Visual Question
Answering System by Knowledge Distillation [20.33235443471006]
We propose a knowledge distillation approach to extend an English language-vision model (teacher) into an equally effective multilingual and code-mixed model (student)
We also create the large-scale multilingual and code-mixed VQA dataset in eleven different language setups.
Experimental results and in-depth analysis show the effectiveness of the proposed VQA model over the pre-trained language-vision models on eleven diverse language setups.
arXiv Detail & Related papers (2021-09-10T03:47:29Z) - MergeDistill: Merging Pre-trained Language Models using Distillation [5.396915402673246]
We propose MergeDistill, a framework to merge pre-trained LMs in a way that can best leverage their assets with minimal dependencies.
We demonstrate the applicability of our framework in a practical setting by leveraging pre-existing teacher LMs and training student LMs that perform competitively with or even outperform teacher LMs trained on several orders of magnitude more data and with a fixed model capacity.
arXiv Detail & Related papers (2021-06-05T08:22:05Z) - One Teacher is Enough? Pre-trained Language Model Distillation from
Multiple Teachers [54.146208195806636]
We propose a multi-teacher knowledge distillation framework named MT-BERT for pre-trained language model compression.
We show that MT-BERT can train high-quality student model from multiple teacher PLMs.
Experiments on three benchmark datasets validate the effectiveness of MT-BERT in compressing PLMs.
arXiv Detail & Related papers (2021-06-02T08:42:33Z) - LightMBERT: A Simple Yet Effective Method for Multilingual BERT
Distillation [45.65004479806485]
multilingual pre-trained language models have shown impressive performance on cross-lingual natural language understanding tasks.
These models are computationally intensive and difficult to be deployed on resource-restricted devices.
We propose a simple yet effective distillation method (LightMBERT) for transferring the cross-lingual generalization ability of the multilingual BERT to a small student model.
arXiv Detail & Related papers (2021-03-11T02:24:41Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.