PTT5: Pretraining and validating the T5 model on Brazilian Portuguese
data
- URL: http://arxiv.org/abs/2008.09144v2
- Date: Thu, 8 Oct 2020 18:37:54 GMT
- Title: PTT5: Pretraining and validating the T5 model on Brazilian Portuguese
data
- Authors: Diedre Carmo, Marcos Piau, Israel Campiotti, Rodrigo Nogueira, Roberto
Lotufo
- Abstract summary: We pretrain a T5 model on the BrWac corpus, an extensive collection of web pages in Portuguese.
We show that our Portuguese pretrained models have significantly better performance over the original T5 models.
- Score: 4.579262239784748
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In natural language processing (NLP), there is a need for more resources in
Portuguese, since much of the data used in the state-of-the-art research is in
other languages. In this paper, we pretrain a T5 model on the BrWac corpus, an
extensive collection of web pages in Portuguese, and evaluate its performance
against other Portuguese pretrained models and multilingual models on three
different tasks. We show that our Portuguese pretrained models have
significantly better performance over the original T5 models. Moreover, we
demonstrate the positive impact of using a Portuguese vocabulary. Our code and
models are available at https://github.com/unicamp-dl/PTT5.
Related papers
- From Brazilian Portuguese to European Portuguese [2.048226951354646]
Brazilian Portuguese and European Portuguese are two varieties of the same language.
There is a significant disproportion in the availability of resources between the two variants.
This inequity can impact the quality of translation services accessible to European Portuguese speakers.
arXiv Detail & Related papers (2024-08-14T10:58:48Z) - ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language [10.39816548971042]
This work introduces $texttptt5-v2$, investigating the continued pretraining of T5 models for Portuguese.
Finetuning on three Portuguese downstream tasks yields SOTA results on the latter two.
Perhaps surprisingly, their impact remains subtle compared to our baseline.
arXiv Detail & Related papers (2024-06-16T05:17:56Z) - PORTULAN ExtraGLUE Datasets and Models: Kick-starting a Benchmark for the Neural Processing of Portuguese [1.2779732438508473]
We contribute a collection of datasets for an array of language processing tasks and a collection of fine-tuned neural language models on these downstream tasks.
To align with mainstream benchmarks in the literature, originally developed in English, the datasets were machine-translated from English with a state-of-the-art translation engine.
The resulting PORTULAN ExtraGLUE benchmark is a basis for research on Portuguese whose improvement can be pursued in future work.
arXiv Detail & Related papers (2024-04-08T09:22:41Z) - Multilingual E5 Text Embeddings: A Technical Report [63.503320030117145]
Three embedding models of different sizes are provided, offering a balance between the inference efficiency and embedding quality.
We introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes.
arXiv Detail & Related papers (2024-02-08T13:47:50Z) - Crosslingual Generalization through Multitask Finetuning [80.8822603322471]
Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting.
We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0.
We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages.
arXiv Detail & Related papers (2022-11-03T13:19:32Z) - Evaluation of Transfer Learning for Polish with a Text-to-Text Model [54.81823151748415]
We introduce a new benchmark for assessing the quality of text-to-text models for Polish.
The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering.
We present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective.
arXiv Detail & Related papers (2022-05-18T09:17:14Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - Transformers and Transfer Learning for Improving Portuguese Semantic
Role Labeling [2.9005223064604078]
For low resource languages, and in particular for Portuguese, currently available SRL models are hindered by scarce training data.
We explore a model architecture with only a pre-trained BERT-based model, a linear layer, softmax and Viterbi decoding.
arXiv Detail & Related papers (2021-01-04T19:56:01Z) - mT5: A massively multilingual pre-trained text-to-text transformer [60.0210636815514]
"Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on English-language NLP tasks.
We introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages.
arXiv Detail & Related papers (2020-10-22T17:58:14Z) - Lite Training Strategies for Portuguese-English and English-Portuguese
Translation [67.4894325619275]
We investigate the use of pre-trained models, such as T5, for Portuguese-English and English-Portuguese translation tasks.
We propose an adaptation of the English tokenizer to represent Portuguese characters, such as diaeresis, acute and grave accents.
Our results show that our models have a competitive performance to state-of-the-art models while being trained on modest hardware.
arXiv Detail & Related papers (2020-08-20T04:31:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.