Multilingual unsupervised sequence segmentation transfers to extremely
low-resource languages
- URL: http://arxiv.org/abs/2110.08415v1
- Date: Sat, 16 Oct 2021 00:08:28 GMT
- Title: Multilingual unsupervised sequence segmentation transfers to extremely
low-resource languages
- Authors: C.M. Downey, Shannon Drizin, Levon Haroutunian, Shivin Thukral
- Abstract summary: Unsupervised sequence-segmentation performance can be transferred to extremely low-resource languages by pre-training a Masked Segmental Language Model multilingually.
We show that this transfer can be achieved by training over a collection of low-resource languages that are typologically similar (but phylogenetically unrelated) to the target language.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We show that unsupervised sequence-segmentation performance can be
transferred to extremely low-resource languages by pre-training a Masked
Segmental Language Model (Downey et al., 2021) multilingually. Further, we show
that this transfer can be achieved by training over a collection of
low-resource languages that are typologically similar (but phylogenetically
unrelated) to the target language. In our experiments, we transfer from a
collection of 10 Indigenous American languages (AmericasNLP, Mager et al.,
2021) to K'iche', a Mayan language. We compare our model to a monolingual
baseline, and show that the multilingual pre-trained approach yields much more
consistent segmentation quality across target dataset sizes, including a
zero-shot performance of 20.6 F1, and exceeds the monolingual performance in
9/10 experimental settings. These results have promising implications for
low-resource NLP pipelines involving human-like linguistic units, such as the
sparse transcription framework proposed by Bird (2020).
Related papers
- Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - UniMax: Fairer and more Effective Language Sampling for Large-Scale
Multilingual Pretraining [92.3702056505905]
We propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages.
We find that UniMax outperforms standard temperature-based sampling, and the benefits persist as scale increases.
arXiv Detail & Related papers (2023-04-18T17:45:50Z) - Investigating the Translation Performance of a Large Multilingual
Language Model: the Case of BLOOM [8.858671209228536]
We focus on BLOOM's multilingual ability by evaluating its machine translation performance across several datasets.
We study several aspects including prompt design, model sizes, cross-lingual transfer and the use of discursive context.
arXiv Detail & Related papers (2023-03-03T13:23:42Z) - High-resource Language-specific Training for Multilingual Neural Machine
Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference.
Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder.
HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z) - Towards the Next 1000 Languages in Multilingual Machine Translation:
Exploring the Synergy Between Supervised and Self-Supervised Learning [48.15259834021655]
We present a pragmatic approach towards building a multilingual machine translation model that covers hundreds of languages.
We use a mixture of supervised and self-supervised objectives, depending on the data availability for different language pairs.
We demonstrate that the synergy between these two training paradigms enables the model to produce high-quality translations in the zero-resource setting.
arXiv Detail & Related papers (2022-01-09T23:36:44Z) - Adapting Monolingual Models: Data can be Scarce when Language Similarity
is High [3.249853429482705]
We investigate the performance of zero-shot transfer learning with as little data as possible.
We retrain the lexical layers of four BERT-based models using data from two low-resource target language varieties.
With high language similarity, 10MB of data appears sufficient to achieve substantial monolingual transfer performance.
arXiv Detail & Related papers (2021-05-06T17:43:40Z) - AmericasNLI: Evaluating Zero-shot Natural Language Understanding of
Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas.
We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches.
We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv Detail & Related papers (2021-04-18T05:32:28Z) - Improving Massively Multilingual Neural Machine Translation and
Zero-Shot Translation [81.7786241489002]
Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations.
We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics.
We propose random online backtranslation to enforce the translation of unseen training language pairs.
arXiv Detail & Related papers (2020-04-24T17:21:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.