ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language
- URL: http://arxiv.org/abs/2406.10806v1
- Date: Sun, 16 Jun 2024 05:17:56 GMT
- Title: ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language
- Authors: Marcos Piau, Roberto Lotufo, Rodrigo Nogueira,
- Abstract summary: This work introduces $texttptt5-v2$, investigating the continued pretraining of T5 models for Portuguese.
Finetuning on three Portuguese downstream tasks yields SOTA results on the latter two.
Perhaps surprisingly, their impact remains subtle compared to our baseline.
- Score: 10.39816548971042
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite advancements in Natural Language Processing (NLP) and the growing availability of pretrained models, the English language remains the primary focus of model development. Continued pretraining on language-specific corpora provides a practical solution for adapting models to other languages. However, the impact of different pretraining settings on downstream tasks remains underexplored. This work introduces $\texttt{ptt5-v2}$, investigating the continued pretraining of T5 models for Portuguese. We first develop a baseline set of settings and pretrain models with sizes up to 3B parameters. Finetuning on three Portuguese downstream tasks (assin2 STS, assin2 RTE, and TweetSentBR) yields SOTA results on the latter two. We then explore the effects of different pretraining configurations, including quality filters, optimization strategies, and multi-epoch pretraining. Perhaps surprisingly, their impact remains subtle compared to our baseline. We release $\texttt{ptt5-v2}$ pretrained checkpoints and the finetuned MonoT5 rerankers on HuggingFace at https://huggingface.co/collections/unicamp-dl/ptt5-v2-666538a650188ba00aa8d2d0 and https://huggingface.co/collections/unicamp-dl/monoptt5-66653981877df3ea727f720d.
Related papers
- A Text-to-Text Model for Multilingual Offensive Language Identification [19.23565690468299]
This study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5)
Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks.
Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5.
arXiv Detail & Related papers (2023-12-06T09:37:27Z) - Investigating Pre-trained Language Models on Cross-Domain Datasets, a
Step Closer to General AI [0.8889304968879164]
We investigate the ability of pre-trained language models to generalize to different non-language tasks.
The four pre-trained models that we used, T5, BART, BERT, and GPT-2 achieve outstanding results.
arXiv Detail & Related papers (2023-06-21T11:55:17Z) - T5lephone: Bridging Speech and Text Self-supervised Models for Spoken
Language Understanding via Phoneme level T5 [65.32642587901903]
We conduct extensive studies on how PLMs with different tokenization strategies affect spoken language understanding task.
We extend the idea to create T5lephone, a variant of T5 that is pretrained using phonemicized text.
arXiv Detail & Related papers (2022-11-01T17:00:23Z) - Sequence to sequence pretraining for a less-resourced Slovenian language [0.0]
We train two different sized T5-type sequence to sequence models for morphologically rich Slovene language with much less resources and analyzed their behavior.
Concerning classification tasks, the SloT5 models mostly lag behind the monolingual Slovene SloBERTa model but are to be considered for the generative tasks.
arXiv Detail & Related papers (2022-07-28T10:08:50Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Pre-Training a Language Model Without Human Language [74.11825654535895]
We study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance.
We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks.
To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.
arXiv Detail & Related papers (2020-12-22T13:38:06Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - PTT5: Pretraining and validating the T5 model on Brazilian Portuguese
data [4.579262239784748]
We pretrain a T5 model on the BrWac corpus, an extensive collection of web pages in Portuguese.
We show that our Portuguese pretrained models have significantly better performance over the original T5 models.
arXiv Detail & Related papers (2020-08-20T18:10:13Z) - Recipes for Adapting Pre-trained Monolingual and Multilingual Models to
Machine Translation [50.0258495437314]
We investigate the benefits and drawbacks of freezing parameters, and adding new ones, when fine-tuning a pre-trained model on Machine Translation (MT)
For BART we get the best performance by freezing most of the model parameters, and adding extra positional embeddings.
For mBART we match or outperform the performance of naive fine-tuning for most language pairs with the encoder, and most of the decoder, frozen.
arXiv Detail & Related papers (2020-04-30T16:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.