Related papers: nanoT5: A PyTorch Framework for Pre-training and Fine-tuning T5-style Models with Limited Resources

nanoT5: A PyTorch Framework for Pre-training and Fine-tuning T5-style Models with Limited Resources

URL: http://arxiv.org/abs/2309.02373v2
Date: Tue, 24 Oct 2023 14:53:50 GMT
Title: nanoT5: A PyTorch Framework for Pre-training and Fine-tuning T5-style Models with Limited Resources
Authors: Piotr Nawrot
Abstract summary: We present nanoT5, a framework for efficient pre-training and fine-tuning of T5 models. nanoT5 allows a T5-Base model to be pre-trained on a single GPU in just 16 hours, without any loss in performance. We make our contributions, including configurations, insights, and pre-trained models, available to the public.
Score: 1.9813574408340644
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-of-the-art language models like T5 have revolutionized the NLP landscape, but their computational demands hinder a large portion of the research community. To address this challenge, we present nanoT5, a specially-optimized PyTorch framework for efficient pre-training and fine-tuning of T5 models. Drawing on insights from optimizer differences and prioritizing efficiency, nanoT5 allows a T5-Base model to be pre-trained on a single GPU in just 16 hours, without any loss in performance. With the introduction of this open-source framework, we hope to widen the accessibility to language modelling research and cater to the community's demand for more user-friendly T5 (Encoder-Decoder) implementations. We make our contributions, including configurations, codebase, pre-training insights, and pre-trained models, available to the public.

Related papers

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals [58.83169560132308]
We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of the representations and computations learned by very large neural networks.
arXiv Detail & Related papers (2024-07-18T17:59:01Z)
UT5: Pretraining Non autoregressive T5 with unrolled denoising [9.656399724144192]
We studied unsupervised pretraining for non auto-regressive T5 models via unrolled denoising. We showed its SoTA results in downstream generation tasks such as SQuAD question generation and XSum.
arXiv Detail & Related papers (2023-11-14T21:28:10Z)
Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers [98.30298332661323]
This paper explores the effectiveness of model-generated signals in improving zero-shot generalization of text-to-text Transformers such as T5. We develop a new model, METRO-T0, which is pretrained using the redesigned ELECTRA-Style pretraining strategies and then prompt-finetuned on a mixture of NLP tasks. Our analysis on model's neural activation and parameter sensitivity reveals that the effectiveness of METRO-T0 stems from more balanced contribution of parameters and better utilization of their capacity.
arXiv Detail & Related papers (2023-05-21T21:06:23Z)
Evaluation of Transfer Learning for Polish with a Text-to-Text Model [54.81823151748415]
We introduce a new benchmark for assessing the quality of text-to-text models for Polish. The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering. We present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective.
arXiv Detail & Related papers (2022-05-18T09:17:14Z)
Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups. We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z)
DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive. We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights. Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z)
EncT5: Fine-tuning T5 Encoder for Non-autoregressive Tasks [9.141586109808895]
We study fine-tuning pre-trained encoder-decoder models such as T5. Our experimental results show that textbfEncT5 with less than half of the parameters of T5 performs similarly to T5 models on GLUE benchmark.
arXiv Detail & Related papers (2021-10-16T00:50:08Z)
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers [57.931830650323]
This paper presents scaling insights from pretraining and finetuning Transformers. We show that aside from only the model size, model shape matters for downstream fine-tuning. We present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality.
arXiv Detail & Related papers (2021-09-22T12:29:15Z)
NT5?! Training T5 to Perform Numerical Reasoning [0.8827543048499855]
Numerical reasoning over text (NRoT) presents unique challenges that are not well addressed by existing pre-training objectives. We show that training the T5 multitasking framework with multiple numerical reasoning datasets of increasing difficulty can be achieved without manually engineering partitioned functionality.
arXiv Detail & Related papers (2021-04-15T08:34:44Z)
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [35.84448624327473]
We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. We show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources.
arXiv Detail & Related papers (2021-01-11T16:11:52Z)
mT5: A massively multilingual pre-trained text-to-text transformer [60.0210636815514]
"Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on English-language NLP tasks. We introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages.
arXiv Detail & Related papers (2020-10-22T17:58:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.