Extending Multilingual BERT to Low-Resource Languages
- URL: http://arxiv.org/abs/2004.13640v1
- Date: Tue, 28 Apr 2020 16:36:41 GMT
- Title: Extending Multilingual BERT to Low-Resource Languages
- Authors: Zihan Wang, Karthikeyan K, Stephen Mayhew, Dan Roth
- Abstract summary: M-BERT (M-BERT) has been a huge success in both supervised and zero-shot cross-lingual transfer learning.
We propose a simple but effective approach to extend M-BERT so that it can benefit any new language.
- Score: 71.0976635999159
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multilingual BERT (M-BERT) has been a huge success in both supervised and
zero-shot cross-lingual transfer learning. However, this success has focused
only on the top 104 languages in Wikipedia that it was trained on. In this
paper, we propose a simple but effective approach to extend M-BERT (E-BERT) so
that it can benefit any new language, and show that our approach benefits
languages that are already in M-BERT as well. We perform an extensive set of
experiments with Named Entity Recognition (NER) on 27 languages, only 16 of
which are in M-BERT, and show an average increase of about 6% F1 on languages
that are already in M-BERT and 23% F1 increase on new languages.
Related papers
- Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - gaBERT -- an Irish Language Model [7.834915319072005]
gaBERT is a monolingual BERT model for the Irish language.
We show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance.
arXiv Detail & Related papers (2021-07-27T16:38:53Z) - Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained
Language Models [6.166295570030645]
Masked sentences such as "Paris is the capital of [MASK]" are used as probes.
We translate the established benchmarks TREx and GoogleRE into 53 languages.
We find that using mBERT as a knowledge base yields varying performance across languages.
arXiv Detail & Related papers (2021-02-01T15:07:06Z) - Anchor-based Bilingual Word Embeddings for Low-Resource Languages [76.48625630211943]
Good quality monolingual word embeddings (MWEs) can be built for languages which have large amounts of unlabeled text.
MWEs can be aligned to bilingual spaces using only a few thousand word translation pairs.
This paper proposes a new approach for building BWEs in which the vector space of the high resource source language is used as a starting point.
arXiv Detail & Related papers (2020-10-23T19:17:00Z) - Looking for Clues of Language in Multilingual BERT to Improve
Cross-lingual Generalization [56.87201892585477]
Token embeddings in multilingual BERT (m-BERT) contain both language and semantic information.
We control the output languages of multilingual BERT by manipulating the token embeddings.
arXiv Detail & Related papers (2020-10-20T05:41:35Z) - It's not Greek to mBERT: Inducing Word-Level Translations from
Multilingual BERT [54.84185432755821]
multilingual BERT (mBERT) learns rich cross-lingual representations, that allow for transfer across languages.
We study the word-level translation information embedded in mBERT and present two simple methods that expose remarkable translation capabilities with no fine-tuning.
arXiv Detail & Related papers (2020-10-16T09:49:32Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z) - Are All Languages Created Equal in Multilingual BERT? [22.954688396858085]
Multilingual BERT (mBERT) trained on 104 languages has shown surprisingly good cross-lingual performance on several NLP tasks.
We explore how mBERT performs on a much wider set of languages, focusing on the quality of representation for low-resource languages.
arXiv Detail & Related papers (2020-05-18T21:15:39Z) - An Empirical Study of Pre-trained Transformers for Arabic Information
Extraction [25.10651348642055]
We pre-train a customized bilingual BERT, dubbed GigaBERT, specifically for Arabic NLP and English-to-Arabic zero-shot transfer learning.
We study GigaBERT's effectiveness on zero-short transfer across four IE tasks: named entity recognition, part-of-speech tagging, argument role labeling, and relation extraction.
Our best model significantly outperforms mBERT, XLM-RoBERTa, and AraBERT in both the supervised and zero-shot transfer settings.
arXiv Detail & Related papers (2020-04-30T00:01:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.