Related papers: ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language

ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language

URL: http://arxiv.org/abs/2406.10806v2
Date: Mon, 18 Nov 2024 02:19:02 GMT
Title: ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language
Authors: Marcos Piau, Roberto Lotufo, Rodrigo Nogueira,
Abstract summary: This work introduces $textttptt5-v2$, investigating the continued pretraining of T5 models for Portuguese. Finetuning on three Portuguese downstream tasks yields SOTA results on the latter two. Perhaps surprisingly, their impact remains subtle compared to our baseline.
Score: 10.39816548971042
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite advancements in Natural Language Processing (NLP) and the growing availability of pretrained models, the English language remains the primary focus of model development. Continued pretraining on language-specific corpora provides a practical solution for adapting models to other languages. However, the impact of different pretraining settings on downstream tasks remains underexplored. This work introduces $\texttt{ptt5-v2}$, investigating the continued pretraining of T5 models for Portuguese. We first develop a baseline set of settings and pretrain models with sizes up to 3B parameters. Finetuning on three Portuguese downstream tasks (assin2 STS, assin2 RTE, and TweetSentBR) yields SOTA results on the latter two. We then explore the effects of different pretraining configurations, including pretraining data quality, optimization strategies, and multi-epoch pretraining. Perhaps surprisingly, their impact remains subtle compared to our baseline. We release $\texttt{ptt5-v2}$ pretrained checkpoints and their MonoT5-based finetuned $\texttt{MonoPTT5}$ rerankers on HuggingFace in their respective collections at \url{https://huggingface.co/unicamp-dl}.

Related papers

Overtrained Language Models Are Harder to Fine-Tune [64.44743256512237]
Large language models are pre-trained on ever-growing token budgets.<n>We show that extended pre-training can make models harder to fine-tune, leading to degraded final performance.
arXiv Detail & Related papers (2025-03-24T23:11:56Z)
A Text-to-Text Model for Multilingual Offensive Language Identification [19.23565690468299]
This study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5) Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks. Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5.
arXiv Detail & Related papers (2023-12-06T09:37:27Z)
Investigating Pre-trained Language Models on Cross-Domain Datasets, a Step Closer to General AI [0.8889304968879164]
We investigate the ability of pre-trained language models to generalize to different non-language tasks. The four pre-trained models that we used, T5, BART, BERT, and GPT-2 achieve outstanding results.
arXiv Detail & Related papers (2023-06-21T11:55:17Z)
T5lephone: Bridging Speech and Text Self-supervised Models for Spoken Language Understanding via Phoneme level T5 [65.32642587901903]
We conduct extensive studies on how PLMs with different tokenization strategies affect spoken language understanding task. We extend the idea to create T5lephone, a variant of T5 that is pretrained using phonemicized text.
arXiv Detail & Related papers (2022-11-01T17:00:23Z)
Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups. We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z)
Improving Large-scale Language Models and Resources for Filipino [0.0]
We outline the construction of the TLUnified dataset, a large-scale pretraining corpus for the Filipino language. Second, we pretrain new Transformer language models following the RoBERTa pretraining technique to supplant existing models trained with small corpora. Our new RoBERTa models show significant improvements over existing Filipino models in three benchmark datasets with an average gain of 4.47% test accuracy.
arXiv Detail & Related papers (2021-11-11T05:00:58Z)
bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model. bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z)
A Survey of Recent Abstract Summarization Techniques [0.0]
We investigate the impact of pre-training models on several Wikipedia datasets in English and Indonesian language. The most significant factors that influence ROUGE performance are coverage, density, and compression. The T5-Large, the Pegasus-XSum, and the ProphetNet-CNNDM provide the best summarization.
arXiv Detail & Related papers (2021-04-15T20:01:34Z)
Pre-Training a Language Model Without Human Language [74.11825654535895]
We study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance. We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks. To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.
arXiv Detail & Related papers (2020-12-22T13:38:06Z)
Emergent Communication Pretraining for Few-Shot Machine Translation [66.48990742411033]
We pretrain neural networks via emergent communication from referential games. Our key assumption is that grounding communication on images---as a crude approximation of real-world environments---inductively biases the model towards learning natural languages.
arXiv Detail & Related papers (2020-11-02T10:57:53Z)
Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting. Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking. We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
A Tailored Pre-Training Model for Task-Oriented Dialog Generation [60.05269529832447]
We propose a Pre-trained Role Alternating Language model (PRAL) for task-oriented conversational systems. We introduce a task-oriented dialog pretraining dataset by cleaning 13 existing data sets. The results show that PRAL performs better or on par with state-of-the-art methods.
arXiv Detail & Related papers (2020-04-24T09:25:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.