Are All Languages Created Equal in Multilingual BERT?
- URL: http://arxiv.org/abs/2005.09093v2
- Date: Thu, 1 Oct 2020 02:46:15 GMT
- Title: Are All Languages Created Equal in Multilingual BERT?
- Authors: Shijie Wu, Mark Dredze
- Abstract summary: Multilingual BERT (mBERT) trained on 104 languages has shown surprisingly good cross-lingual performance on several NLP tasks.
We explore how mBERT performs on a much wider set of languages, focusing on the quality of representation for low-resource languages.
- Score: 22.954688396858085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multilingual BERT (mBERT) trained on 104 languages has shown surprisingly
good cross-lingual performance on several NLP tasks, even without explicit
cross-lingual signals. However, these evaluations have focused on cross-lingual
transfer with high-resource languages, covering only a third of the languages
covered by mBERT. We explore how mBERT performs on a much wider set of
languages, focusing on the quality of representation for low-resource
languages, measured by within-language performance. We consider three tasks:
Named Entity Recognition (99 languages), Part-of-speech Tagging, and Dependency
Parsing (54 languages each). mBERT does better than or comparable to baselines
on high resource languages but does much worse for low resource languages.
Furthermore, monolingual BERT models for these languages do even worse. Paired
with similar languages, the performance gap between monolingual BERT and mBERT
can be narrowed. We find that better models for low resource languages require
more efficient pretraining techniques or more data.
Related papers
- Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - When Is Multilinguality a Curse? Language Modeling for 250 High- and
Low-Resource Languages [25.52470575274251]
We pre-train over 10,000 monolingual and multilingual language models for over 250 languages.
We find that in moderation, adding multilingual data improves low-resource language modeling performance.
As dataset sizes increase, adding multilingual data begins to hurt performance for both low-resource and high-resource languages.
arXiv Detail & Related papers (2023-11-15T18:47:42Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - To What Degree Can Language Borders Be Blurred In BERT-based
Multilingual Spoken Language Understanding? [7.245261469258502]
We show that although a BERT-based multilingual Spoken Language Understanding (SLU) model works substantially well even on distant language groups, there is still a gap to the ideal multilingual performance.
We propose a novel BERT-based adversarial model architecture to learn language-shared and language-specific representations for multilingual SLU.
arXiv Detail & Related papers (2020-11-10T09:59:24Z) - Towards Fully Bilingual Deep Language Modeling [1.3455090151301572]
We consider whether it is possible to pre-train a bilingual model for two remotely related languages without compromising performance at either language.
We create a Finnish-English bilingual BERT model and evaluate its performance on datasets used to evaluate the corresponding monolingual models.
Our bilingual model performs on par with Google's original English BERT on GLUE and nearly matches the performance of monolingual Finnish BERT on a range of Finnish NLP tasks.
arXiv Detail & Related papers (2020-10-22T12:22:50Z) - Looking for Clues of Language in Multilingual BERT to Improve
Cross-lingual Generalization [56.87201892585477]
Token embeddings in multilingual BERT (m-BERT) contain both language and semantic information.
We control the output languages of multilingual BERT by manipulating the token embeddings.
arXiv Detail & Related papers (2020-10-20T05:41:35Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z) - Learning to Scale Multilingual Representations for Vision-Language Tasks [51.27839182889422]
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date.
We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
arXiv Detail & Related papers (2020-04-09T01:03:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.