Making Monolingual Sentence Embeddings Multilingual using Knowledge
Distillation
- URL: http://arxiv.org/abs/2004.09813v2
- Date: Mon, 5 Oct 2020 06:30:56 GMT
- Title: Making Monolingual Sentence Embeddings Multilingual using Knowledge
Distillation
- Authors: Nils Reimers, Iryna Gurevych
- Abstract summary: We present an easy and efficient method to extend existing sentence embedding models to new languages.
This allows to create multilingual versions from previously monolingual models.
- Score: 73.65237422910738
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present an easy and efficient method to extend existing sentence embedding
models to new languages. This allows to create multilingual versions from
previously monolingual models. The training is based on the idea that a
translated sentence should be mapped to the same location in the vector space
as the original sentence. We use the original (monolingual) model to generate
sentence embeddings for the source language and then train a new system on
translated sentences to mimic the original model. Compared to other methods for
training multilingual sentence embeddings, this approach has several
advantages: It is easy to extend existing models with relatively few samples to
new languages, it is easier to ensure desired properties for the vector space,
and the hardware requirements for training is lower. We demonstrate the
effectiveness of our approach for 50+ languages from various language families.
Code to extend sentence embeddings models to more than 400 languages is
publicly available.
Related papers
- Extending the Subwording Model of Multilingual Pretrained Models for New
Languages [31.702393348980735]
In this paper, we add new subwords to the SentencePiece tokenizer to apply a multilingual pretrained model to new languages.
In our experiments, we segmented Inuktitut sentences into subwords without changing the segmentation of already pretrained languages.
arXiv Detail & Related papers (2022-11-29T06:55:34Z) - Language-Family Adapters for Low-Resource Multilingual Neural Machine
Translation [129.99918589405675]
Large multilingual models trained with self-supervision achieve state-of-the-art results in a wide range of natural language processing tasks.
Multilingual fine-tuning improves performance on low-resource languages but requires modifying the entire model and can be prohibitively expensive.
We propose training language-family adapters on top of mBART-50 to facilitate cross-lingual transfer.
arXiv Detail & Related papers (2022-09-30T05:02:42Z) - WECHSEL: Effective initialization of subword embeddings for
cross-lingual transfer of monolingual language models [3.6878069324996616]
We introduce a method -- called WECHSEL -- to transfer English models to new languages.
We use WECHSEL to transfer GPT-2 and RoBERTa models to 4 other languages.
arXiv Detail & Related papers (2021-12-13T12:26:02Z) - Continual Learning in Multilingual NMT via Language-Specific Embeddings [92.91823064720232]
It consists in replacing the shared vocabulary with a small language-specific vocabulary and fine-tuning the new embeddings on the new language's parallel data.
Because the parameters of the original model are not modified, its performance on the initial languages does not degrade.
arXiv Detail & Related papers (2021-10-20T10:38:57Z) - How Good is Your Tokenizer? On the Monolingual Performance of
Multilingual Language Models [96.32118305166412]
We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks.
We find that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts.
arXiv Detail & Related papers (2020-12-31T14:11:00Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Multilingual Translation with Extensible Multilingual Pretraining and
Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext.
We show that multilingual translation models can be created through multilingual finetuning.
We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.