Expanding Pretrained Models to Thousands More Languages via
Lexicon-based Adaptation
- URL: http://arxiv.org/abs/2203.09435v1
- Date: Thu, 17 Mar 2022 16:48:22 GMT
- Title: Expanding Pretrained Models to Thousands More Languages via
Lexicon-based Adaptation
- Authors: Xinyi Wang, Sebastian Ruder, Graham Neubig
- Abstract summary: Our study highlights how NLP methods can be adapted to thousands more languages that are under-served by current technology.
For 19 under-represented languages across 3 tasks, our methods lead to consistent improvements of up to 5 and 15 points with and without extra monolingual text respectively.
- Score: 133.7313847857935
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The performance of multilingual pretrained models is highly dependent on the
availability of monolingual or parallel text present in a target language.
Thus, the majority of the world's languages cannot benefit from recent progress
in NLP as they have no or limited textual data. To expand possibilities of
using NLP technology in these under-represented languages, we systematically
study strategies that relax the reliance on conventional language resources
through the use of bilingual lexicons, an alternative resource with much better
language coverage. We analyze different strategies to synthesize textual or
labeled data using lexicons, and how this data can be combined with monolingual
or parallel text when available. For 19 under-represented languages across 3
tasks, our methods lead to consistent improvements of up to 5 and 15 points
with and without extra monolingual text respectively. Overall, our study
highlights how NLP methods can be adapted to thousands more languages that are
under-served by current technology
Related papers
- Embedding structure matters: Comparing methods to adapt multilingual
vocabularies to new languages [20.17308477850864]
Pre-trained multilingual language models underpin a large portion of modern NLP tools outside of English.
We propose several simple techniques to replace a cross-lingual vocabulary with a compact, language-specific one.
arXiv Detail & Related papers (2023-09-09T04:27:18Z) - Multilingual Text Representation [3.4447129363520337]
Modern NLP breakthrough includes large multilingual models capable of performing tasks across more than 100 languages.
State-of-the-art language models came a long way, starting from the simple one-hot representation of words.
We discuss how the full potential of language democratization could be obtained, reaching beyond the known limits.
arXiv Detail & Related papers (2023-09-02T14:21:22Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.