WikiBERT models: deep transfer learning for many languages
- URL: http://arxiv.org/abs/2006.01538v1
- Date: Tue, 2 Jun 2020 11:57:53 GMT
- Title: WikiBERT models: deep transfer learning for many languages
- Authors: Sampo Pyysalo, Jenna Kanerva, Antti Virtanen, Filip Ginter
- Abstract summary: We introduce a simple, fully automated pipeline for creating languagespecific BERT models from Wikipedia data.
We assess the merits of these models using the state-of-the-art UDify on Universal Dependencies data.
- Score: 1.3455090151301572
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural language models such as BERT have enabled substantial recent
advances in many natural language processing tasks. Due to the effort and
computational cost involved in their pre-training, language-specific models are
typically introduced only for a small number of high-resource languages such as
English. While multilingual models covering large numbers of languages are
available, recent work suggests monolingual training can produce better models,
and our understanding of the tradeoffs between mono- and multilingual training
is incomplete. In this paper, we introduce a simple, fully automated pipeline
for creating language-specific BERT models from Wikipedia data and introduce 42
new such models, most for languages up to now lacking dedicated deep neural
language models. We assess the merits of these models using the
state-of-the-art UDify parser on Universal Dependencies data, contrasting
performance with results using the multilingual BERT model. We find that UDify
using WikiBERT models outperforms the parser using mBERT on average, with the
language-specific models showing substantially improved performance for some
languages, yet limited improvement or a decrease in performance for others. We
also present preliminary results as first steps toward an understanding of the
conditions under which language-specific models are most beneficial. All of the
methods and models introduced in this work are available under open licenses
from https://github.com/turkunlp/wikibert.
Related papers
- Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Training dataset and dictionary sizes matter in BERT models: the case of
Baltic languages [0.0]
We train a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian.
We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy.
arXiv Detail & Related papers (2021-12-20T14:26:40Z) - Continual Learning in Multilingual NMT via Language-Specific Embeddings [92.91823064720232]
It consists in replacing the shared vocabulary with a small language-specific vocabulary and fine-tuning the new embeddings on the new language's parallel data.
Because the parameters of the original model are not modified, its performance on the initial languages does not degrade.
arXiv Detail & Related papers (2021-10-20T10:38:57Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Towards Fully Bilingual Deep Language Modeling [1.3455090151301572]
We consider whether it is possible to pre-train a bilingual model for two remotely related languages without compromising performance at either language.
We create a Finnish-English bilingual BERT model and evaluate its performance on datasets used to evaluate the corresponding monolingual models.
Our bilingual model performs on par with Google's original English BERT on GLUE and nearly matches the performance of monolingual Finnish BERT on a range of Finnish NLP tasks.
arXiv Detail & Related papers (2020-10-22T12:22:50Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Multilingual Translation with Extensible Multilingual Pretraining and
Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext.
We show that multilingual translation models can be created through multilingual finetuning.
We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z) - ParsBERT: Transformer-based Model for Persian Language Understanding [0.7646713951724012]
This paper proposes a monolingual BERT for the Persian language (ParsBERT)
It shows its state-of-the-art performance compared to other architectures and multilingual models.
ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones.
arXiv Detail & Related papers (2020-05-26T05:05:32Z) - Give your Text Representation Models some Love: the Case for Basque [24.76979832867631]
Word embeddings and pre-trained language models allow to build rich representations of text.
Many small companies and research groups tend to use models that have been pre-trained and made available by third parties.
This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora.
We show that a number of monolingual models trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks.
arXiv Detail & Related papers (2020-03-31T18:01:56Z) - What the [MASK]? Making Sense of Language-Specific BERT Models [39.54532211263058]
This paper presents the current state of the art in language-specific BERT models.
Our aim is to provide an overview of the commonalities and differences between Language-language-specific BERT models and mBERT models.
arXiv Detail & Related papers (2020-03-05T20:42:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.