Extending the Subwording Model of Multilingual Pretrained Models for New
Languages
- URL: http://arxiv.org/abs/2211.15965v1
- Date: Tue, 29 Nov 2022 06:55:34 GMT
- Title: Extending the Subwording Model of Multilingual Pretrained Models for New
Languages
- Authors: Kenji Imamura and Eiichiro Sumita
- Abstract summary: In this paper, we add new subwords to the SentencePiece tokenizer to apply a multilingual pretrained model to new languages.
In our experiments, we segmented Inuktitut sentences into subwords without changing the segmentation of already pretrained languages.
- Score: 31.702393348980735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multilingual pretrained models are effective for machine translation and
cross-lingual processing because they contain multiple languages in one model.
However, they are pretrained after their tokenizers are fixed; therefore it is
difficult to change the vocabulary after pretraining. When we extend the
pretrained models to new languages, we must modify the tokenizers
simultaneously. In this paper, we add new subwords to the SentencePiece
tokenizer to apply a multilingual pretrained model to new languages (Inuktitut
in this paper). In our experiments, we segmented Inuktitut sentences into
subwords without changing the segmentation of already pretrained languages, and
applied the mBART-50 pretrained model to English-Inuktitut translation.
Related papers
- PreAlign: Boosting Cross-Lingual Transfer by Early Establishment of Multilingual Alignment [68.20851615263953]
Large language models demonstrate reasonable multilingual abilities, despite predominantly English-centric pretraining.
The spontaneous multilingual alignment in these models is shown to be weak, leading to unsatisfactory cross-lingual transfer and knowledge sharing.
We propose PreAlign, a framework that establishes multilingual alignment prior to language model pretraining.
arXiv Detail & Related papers (2024-07-23T06:59:53Z) - Language-Family Adapters for Low-Resource Multilingual Neural Machine
Translation [129.99918589405675]
Large multilingual models trained with self-supervision achieve state-of-the-art results in a wide range of natural language processing tasks.
Multilingual fine-tuning improves performance on low-resource languages but requires modifying the entire model and can be prohibitively expensive.
We propose training language-family adapters on top of mBART-50 to facilitate cross-lingual transfer.
arXiv Detail & Related papers (2022-09-30T05:02:42Z) - WECHSEL: Effective initialization of subword embeddings for
cross-lingual transfer of monolingual language models [3.6878069324996616]
We introduce a method -- called WECHSEL -- to transfer English models to new languages.
We use WECHSEL to transfer GPT-2 and RoBERTa models to 4 other languages.
arXiv Detail & Related papers (2021-12-13T12:26:02Z) - Continual Learning in Multilingual NMT via Language-Specific Embeddings [92.91823064720232]
It consists in replacing the shared vocabulary with a small language-specific vocabulary and fine-tuning the new embeddings on the new language's parallel data.
Because the parameters of the original model are not modified, its performance on the initial languages does not degrade.
arXiv Detail & Related papers (2021-10-20T10:38:57Z) - MSP: Multi-Stage Prompting for Making Pre-trained Language Models Better
Translators [10.557167523009392]
We present Multi-Stage Prompting, a simple and lightweight approach for better adapting pre-trained language models to translation tasks.
To make pre-trained language models better translators, we divide the translation process via pre-trained language models into three separate stages.
During each stage, we independently apply different continuous prompts for allowing pre-trained language models better adapting to translation tasks.
arXiv Detail & Related papers (2021-10-13T10:06:21Z) - How Good is Your Tokenizer? On the Monolingual Performance of
Multilingual Language Models [96.32118305166412]
We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks.
We find that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts.
arXiv Detail & Related papers (2020-12-31T14:11:00Z) - Multilingual Translation with Extensible Multilingual Pretraining and
Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext.
We show that multilingual translation models can be created through multilingual finetuning.
We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z) - Making Monolingual Sentence Embeddings Multilingual using Knowledge
Distillation [73.65237422910738]
We present an easy and efficient method to extend existing sentence embedding models to new languages.
This allows to create multilingual versions from previously monolingual models.
arXiv Detail & Related papers (2020-04-21T08:20:25Z) - Testing pre-trained Transformer models for Lithuanian news clustering [0.0]
Non-English languages could not leverage such new opportunities with the English text pre-trained models.
We compare pre-trained multilingual BERT, XLM-R, and older learned text representation methods as encodings for the task of Lithuanian news clustering.
Our results indicate that publicly available pre-trained multilingual Transformer models can be fine-tuned to surpass word vectors but still score much lower than specially trained doc2vec embeddings.
arXiv Detail & Related papers (2020-04-03T14:41:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.