Distilling Efficient Language-Specific Models for Cross-Lingual Transfer
- URL: http://arxiv.org/abs/2306.01709v1
- Date: Fri, 2 Jun 2023 17:31:52 GMT
- Title: Distilling Efficient Language-Specific Models for Cross-Lingual Transfer
- Authors: Alan Ansell, Edoardo Maria Ponti, Anna Korhonen, Ivan Vuli\'c
- Abstract summary: Massively multilingual Transformers (MMTs) are widely used for cross-lingual transfer learning.
MMTs' language coverage makes them unnecessarily expensive to deploy in terms of model size, inference time, energy, and hardware cost.
We propose to extract compressed, language-specific models from MMTs which retain the capacity of the original MMTs for cross-lingual transfer.
- Score: 75.32131584449786
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Massively multilingual Transformers (MMTs), such as mBERT and XLM-R, are
widely used for cross-lingual transfer learning. While these are pretrained to
represent hundreds of languages, end users of NLP systems are often interested
only in individual languages. For such purposes, the MMTs' language coverage
makes them unnecessarily expensive to deploy in terms of model size, inference
time, energy, and hardware cost. We thus propose to extract compressed,
language-specific models from MMTs which retain the capacity of the original
MMTs for cross-lingual transfer. This is achieved by distilling the MMT
bilingually, i.e., using data from only the source and target language of
interest. Specifically, we use a two-phase distillation approach, termed
BiStil: (i) the first phase distils a general bilingual model from the MMT,
while (ii) the second, task-specific phase sparsely fine-tunes the bilingual
"student" model using a task-tuned variant of the original MMT as its
"teacher". We evaluate this distillation technique in zero-shot cross-lingual
transfer across a number of standard cross-lingual benchmarks. The key results
indicate that the distilled models exhibit minimal degradation in target
language performance relative to the base MMT despite being significantly
smaller and faster. Furthermore, we find that they outperform multilingually
distilled models such as DistilmBERT and MiniLMv2 while having a very modest
training budget in comparison, even on a per-language basis. We also show that
bilingual models distilled from MMTs greatly outperform bilingual models
trained from scratch. Our code and models are available at
https://github.com/AlanAnsell/bistil.
Related papers
- MT4CrossOIE: Multi-stage Tuning for Cross-lingual Open Information
Extraction [38.88339164947934]
Cross-lingual open information extraction aims to extract structured information from raw text across multiple languages.
Previous work uses a shared cross-lingual pre-trained model to handle the different languages but underuses the potential of the language-specific representation.
We propose an effective multi-stage tuning framework called MT4CrossIE, designed for enhancing cross-lingual open information extraction.
arXiv Detail & Related papers (2023-08-12T12:38:10Z) - Revisiting Machine Translation for Cross-lingual Classification [91.43729067874503]
Most research in the area focuses on the multilingual models rather than the Machine Translation component.
We show that, by using a stronger MT system and mitigating the mismatch between training on original text and running inference on machine translated text, translate-test can do substantially better than previously assumed.
arXiv Detail & Related papers (2023-05-23T16:56:10Z) - Distilling a Pretrained Language Model to a Multilingual ASR Model [3.4012007729454816]
We distill the rich knowledge embedded inside a well-trained teacher text model to the student speech model.
We show the superiority of our method on 20 low-resource languages of the CommonVoice dataset with less than 100 hours of speech data.
arXiv Detail & Related papers (2022-06-25T12:36:11Z) - Parameter-Efficient Neural Reranking for Cross-Lingual and Multilingual
Retrieval [66.69799641522133]
State-of-the-art neural (re)rankers are notoriously data hungry.
Current approaches typically transfer rankers trained on English data to other languages and cross-lingual setups by means of multilingual encoders.
We show that two parameter-efficient approaches to cross-lingual transfer, namely Sparse Fine-Tuning Masks (SFTMs) and Adapters, allow for a more lightweight and more effective zero-shot transfer.
arXiv Detail & Related papers (2022-04-05T15:44:27Z) - Cross-Lingual Text Classification with Multilingual Distillation and
Zero-Shot-Aware Training [21.934439663979663]
Multi-branch multilingual language model (MBLM) built on Multilingual pre-trained language models (MPLMs)
Method based on transferring knowledge from high-performance monolingual models with a teacher-student framework.
Results on two cross-lingual classification tasks show that, with only the task's supervised data used, our method improves both the supervised and zero-shot performance of MPLMs.
arXiv Detail & Related papers (2022-02-28T09:51:32Z) - Adapting Monolingual Models: Data can be Scarce when Language Similarity
is High [3.249853429482705]
We investigate the performance of zero-shot transfer learning with as little data as possible.
We retrain the lexical layers of four BERT-based models using data from two low-resource target language varieties.
With high language similarity, 10MB of data appears sufficient to achieve substantial monolingual transfer performance.
arXiv Detail & Related papers (2021-05-06T17:43:40Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - Structure-Level Knowledge Distillation For Multilingual Sequence
Labeling [73.40368222437912]
We propose to reduce the gap between monolingual models and the unified multilingual model by distilling the structural knowledge of several monolingual models to the unified multilingual model (student)
Our experiments on 4 multilingual tasks with 25 datasets show that our approaches outperform several strong baselines and have stronger zero-shot generalizability than both the baseline model and teacher models.
arXiv Detail & Related papers (2020-04-08T07:14:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.