Related papers: IndT5: A Text-to-Text Transformer for 10 Indigenous Languages

IndT5: A Text-to-Text Transformer for 10 Indigenous Languages

URL: http://arxiv.org/abs/2104.07483v2
Date: Tue, 27 Apr 2021 09:07:50 GMT
Title: IndT5: A Text-to-Text Transformer for 10 Indigenous Languages
Authors: El Moatez Billah Nagoudi, Wei-Rui Chen, Muhammad Abdul-Mageed and Hasan Cavusogl
Abstract summary: We introduce IndT5, the first Transformer language model for Indigenous languages. We build IndCorpus--a new dataset for ten Indigenous languages and Spanish. We present the application of IndT5 to machine translation by investigating different approaches to translate between Spanish and the Indigenous languages.
Score: 7.952582509792971
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer language models have become fundamental components of natural language processing based pipelines. Although several Transformer models have been introduced to serve many languages, there is a shortage of models pre-trained for low-resource and Indigenous languages. In this work, we introduce IndT5, the first Transformer language model for Indigenous languages. To train IndT5, we build IndCorpus--a new dataset for ten Indigenous languages and Spanish. We also present the application of IndT5 to machine translation by investigating different approaches to translate between Spanish and the Indigenous languages as part of our contribution to the AmericasNLP 2021 Shared Task on Open Machine Translation. IndT5 and IndCorpus are publicly available for research

Related papers

A Text-to-Text Model for Multilingual Offensive Language Identification [19.23565690468299]
This study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5) Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks. Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5.
arXiv Detail & Related papers (2023-12-06T09:37:27Z)
Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any. We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z)
mmT5: Modular Multilingual Pre-Training Solves Source Language Hallucinations [54.42422445568523]
mmT5 is a modular multilingual sequence-to-sequence model. It disentangles language-specific information from language-agnostic information. Compared to mT5, mmT5 raises the rate of generating text in the correct language under zero-shot settings from 7% to 99%.
arXiv Detail & Related papers (2023-05-23T16:38:01Z)
idT5: Indonesian Version of Multilingual T5 Transformer [0.0]
Indonesian is spoken by almost 200 million people and is the 10th most spoken language in the world. In this study, the mT5 model was adapted for only one language, Indonesian, resulting in a pre-trained T5 model that was specific only for Indonesian with a smaller size. Fine-tuned model based on our model achieved 77.18% accuracy on SA, 8% higher than the mT5-based model, and obtained nearly the same score as the mT5-based model on QG and QA.
arXiv Detail & Related papers (2023-02-02T03:56:16Z)
AraT5: Text-to-Text Transformers for Arabic Language Understanding and Generation [6.021269454707625]
We introduce a new benchmark for Arabic language generation (ARGEN) We pre-train three powerful Arabic-specific text-to-text Transformer based models and evaluate them on the two benchmarks. Our new models perform significantly better than mT5 and exceed MARBERT, the current state-of-the-art Arabic BERT-based model, on Arabic language understanding.
arXiv Detail & Related papers (2021-08-31T02:02:10Z)
mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs [51.67970832510462]
We improve multilingual text-to-text transfer Transformer with translation pairs (mT6) We explore three cross-lingual text-to-text pre-training tasks, namely, machine translation, translation pair span corruption, and translation span corruption. Experimental results show that the proposed mT6 improves cross-lingual transferability over mT5.
arXiv Detail & Related papers (2021-04-18T03:24:07Z)
XLM-T: Scaling up Multilingual Machine Translation with Pretrained Cross-lingual Transformer Encoders [89.0059978016914]
We present XLM-T, which initializes the model with an off-the-shelf pretrained cross-lingual Transformer and fine-tunes it with multilingual parallel data. This simple method achieves significant improvements on a WMT dataset with 10 language pairs and the OPUS-100 corpus with 94 pairs.
arXiv Detail & Related papers (2020-12-31T11:16:51Z)
mT5: A massively multilingual pre-trained text-to-text transformer [60.0210636815514]
"Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on English-language NLP tasks. We introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages.
arXiv Detail & Related papers (2020-10-22T17:58:14Z)
Pre-training Polish Transformer-based Language Models at Scale [1.0312968200748118]
We present two language models for Polish based on the popular BERT architecture. We describe our methodology for collecting the data, preparing the corpus, and pre-training the model. We then evaluate our models on thirteen Polish linguistic tasks, and demonstrate improvements in eleven of them.
arXiv Detail & Related papers (2020-06-07T18:48:58Z)
CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English. It diversified with over 11,000 speakers and over 60 accents. CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.