mT5: A massively multilingual pre-trained text-to-text transformer
- URL: http://arxiv.org/abs/2010.11934v3
- Date: Thu, 11 Mar 2021 18:45:13 GMT
- Title: mT5: A massively multilingual pre-trained text-to-text transformer
- Authors: Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou,
Aditya Siddhant, Aditya Barua, Colin Raffel
- Abstract summary: "Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on English-language NLP tasks.
We introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages.
- Score: 60.0210636815514
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent "Text-to-Text Transfer Transformer" (T5) leveraged a unified
text-to-text format and scale to attain state-of-the-art results on a wide
variety of English-language NLP tasks. In this paper, we introduce mT5, a
multilingual variant of T5 that was pre-trained on a new Common Crawl-based
dataset covering 101 languages. We detail the design and modified training of
mT5 and demonstrate its state-of-the-art performance on many multilingual
benchmarks. We also describe a simple technique to prevent "accidental
translation" in the zero-shot setting, where a generative model chooses to
(partially) translate its prediction into the wrong language. All of the code
and model checkpoints used in this work are publicly available.
Related papers
- Multilingual E5 Text Embeddings: A Technical Report [63.503320030117145]
Three embedding models of different sizes are provided, offering a balance between the inference efficiency and embedding quality.
We introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes.
arXiv Detail & Related papers (2024-02-08T13:47:50Z) - A Text-to-Text Model for Multilingual Offensive Language Identification [19.23565690468299]
This study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5)
Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks.
Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5.
arXiv Detail & Related papers (2023-12-06T09:37:27Z) - mmT5: Modular Multilingual Pre-Training Solves Source Language
Hallucinations [54.42422445568523]
mmT5 is a modular multilingual sequence-to-sequence model.
It disentangles language-specific information from language-agnostic information.
Compared to mT5, mmT5 raises the rate of generating text in the correct language under zero-shot settings from 7% to 99%.
arXiv Detail & Related papers (2023-05-23T16:38:01Z) - idT5: Indonesian Version of Multilingual T5 Transformer [0.0]
Indonesian is spoken by almost 200 million people and is the 10th most spoken language in the world.
In this study, the mT5 model was adapted for only one language, Indonesian, resulting in a pre-trained T5 model that was specific only for Indonesian with a smaller size.
Fine-tuned model based on our model achieved 77.18% accuracy on SA, 8% higher than the mT5-based model, and obtained nearly the same score as the mT5-based model on QG and QA.
arXiv Detail & Related papers (2023-02-02T03:56:16Z) - Evaluating Byte and Wordpiece Level Models for Massively Multilingual
Semantic Parsing [3.431659287330068]
We compare a byte-level (ByT5) and a wordpiece based (mT5) sequence to sequence model on the 51 languages of the MASSIVE multilingual semantic parsing dataset.
We are able to reduce the gap in exact match accuracy to only 5 points with respect to a model trained on gold data from all the languages.
arXiv Detail & Related papers (2022-12-14T13:48:32Z) - T5lephone: Bridging Speech and Text Self-supervised Models for Spoken
Language Understanding via Phoneme level T5 [65.32642587901903]
We conduct extensive studies on how PLMs with different tokenization strategies affect spoken language understanding task.
We extend the idea to create T5lephone, a variant of T5 that is pretrained using phonemicized text.
arXiv Detail & Related papers (2022-11-01T17:00:23Z) - Evaluation of Transfer Learning for Polish with a Text-to-Text Model [54.81823151748415]
We introduce a new benchmark for assessing the quality of text-to-text models for Polish.
The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering.
We present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective.
arXiv Detail & Related papers (2022-05-18T09:17:14Z) - AraT5: Text-to-Text Transformers for Arabic Language Understanding and
Generation [6.021269454707625]
We introduce a new benchmark for Arabic language generation (ARGEN)
We pre-train three powerful Arabic-specific text-to-text Transformer based models and evaluate them on the two benchmarks.
Our new models perform significantly better than mT5 and exceed MARBERT, the current state-of-the-art Arabic BERT-based model, on Arabic language understanding.
arXiv Detail & Related papers (2021-08-31T02:02:10Z) - mT6: Multilingual Pretrained Text-to-Text Transformer with Translation
Pairs [51.67970832510462]
We improve multilingual text-to-text transfer Transformer with translation pairs (mT6)
We explore three cross-lingual text-to-text pre-training tasks, namely, machine translation, translation pair span corruption, and translation span corruption.
Experimental results show that the proposed mT6 improves cross-lingual transferability over mT5.
arXiv Detail & Related papers (2021-04-18T03:24:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.