WECHSEL: Effective initialization of subword embeddings for
cross-lingual transfer of monolingual language models
- URL: http://arxiv.org/abs/2112.06598v1
- Date: Mon, 13 Dec 2021 12:26:02 GMT
- Title: WECHSEL: Effective initialization of subword embeddings for
cross-lingual transfer of monolingual language models
- Authors: Benjamin Minixhofer, Fabian Paischer, Navid Rekabsaz
- Abstract summary: We introduce a method -- called WECHSEL -- to transfer English models to new languages.
We use WECHSEL to transfer GPT-2 and RoBERTa models to 4 other languages.
- Score: 3.6878069324996616
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, large pretrained language models (LMs) have gained popularity.
Training these models requires ever more computational resources and most of
the existing models are trained on English text only. It is exceedingly
expensive to train these models in other languages. To alleviate this problem,
we introduce a method -- called WECHSEL -- to transfer English models to new
languages. We exchange the tokenizer of the English model with a tokenizer in
the target language and initialize token embeddings such that they are close to
semantically similar English tokens by utilizing multilingual static word
embeddings covering English and the target language. We use WECHSEL to transfer
GPT-2 and RoBERTa models to 4 other languages (French, German, Chinese and
Swahili). WECHSEL improves over a previously proposed method for cross-lingual
parameter transfer and outperforms models of comparable size trained from
scratch in the target language with up to 64x less training effort. Our method
makes training large language models for new languages more accessible and less
damaging to the environment. We make our code and models publicly available.
Related papers
- Tik-to-Tok: Translating Language Models One Token at a Time: An
Embedding Initialization Strategy for Efficient Language Adaptation [19.624330093598996]
Training monolingual language models for low and mid-resource languages is made challenging by limited and often inadequate pretraining data.
By generalizing over a word translation dictionary encompassing both the source and target languages, we map tokens from the target tokenizer to semantically similar tokens from the source language tokenizer.
We conduct experiments to convert high-resource models to mid- and low-resource languages, namely Dutch and Frisian.
arXiv Detail & Related papers (2023-10-05T11:45:29Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Efficient Language Model Training through Cross-Lingual and Progressive
Transfer Learning [0.7612676127275795]
Most Transformer language models are pretrained on English text.
As model sizes grow, the performance gap between English and other languages increases even further.
We introduce a cross-lingual and progressive transfer learning approach, called CLP-Transfer.
arXiv Detail & Related papers (2023-01-23T18:56:12Z) - Continual Learning in Multilingual NMT via Language-Specific Embeddings [92.91823064720232]
It consists in replacing the shared vocabulary with a small language-specific vocabulary and fine-tuning the new embeddings on the new language's parallel data.
Because the parameters of the original model are not modified, its performance on the initial languages does not degrade.
arXiv Detail & Related papers (2021-10-20T10:38:57Z) - Allocating Large Vocabulary Capacity for Cross-lingual Language Model
Pre-training [59.571632468137075]
We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity.
We propose an algorithm VoCap to determine the desired vocabulary capacity of each language.
In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax.
arXiv Detail & Related papers (2021-09-15T14:04:16Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - When Being Unseen from mBERT is just the Beginning: Handling New
Languages With Multilingual Language Models [2.457872341625575]
Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP.
We show that such models behave in multiple ways on unseen languages.
arXiv Detail & Related papers (2020-10-24T10:15:03Z) - Reusing a Pretrained Language Model on Languages with Limited Corpora
for Unsupervised NMT [129.99918589405675]
We present an effective approach that reuses an LM that is pretrained only on the high-resource language.
The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model.
Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq)
arXiv Detail & Related papers (2020-09-16T11:37:10Z) - WikiBERT models: deep transfer learning for many languages [1.3455090151301572]
We introduce a simple, fully automated pipeline for creating languagespecific BERT models from Wikipedia data.
We assess the merits of these models using the state-of-the-art UDify on Universal Dependencies data.
arXiv Detail & Related papers (2020-06-02T11:57:53Z) - Making Monolingual Sentence Embeddings Multilingual using Knowledge
Distillation [73.65237422910738]
We present an easy and efficient method to extend existing sentence embedding models to new languages.
This allows to create multilingual versions from previously monolingual models.
arXiv Detail & Related papers (2020-04-21T08:20:25Z) - Give your Text Representation Models some Love: the Case for Basque [24.76979832867631]
Word embeddings and pre-trained language models allow to build rich representations of text.
Many small companies and research groups tend to use models that have been pre-trained and made available by third parties.
This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora.
We show that a number of monolingual models trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks.
arXiv Detail & Related papers (2020-03-31T18:01:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.