Related papers: EncT5: Fine-tuning T5 Encoder for Non-autoregressive Tasks

EncT5: Fine-tuning T5 Encoder for Non-autoregressive Tasks

URL: http://arxiv.org/abs/2110.08426v1
Date: Sat, 16 Oct 2021 00:50:08 GMT
Title: EncT5: Fine-tuning T5 Encoder for Non-autoregressive Tasks
Authors: Frederick Liu, Siamak Shakeri, Hongkun Yu, Jing Li
Abstract summary: We study fine-tuning pre-trained encoder-decoder models such as T5. Our experimental results show that textbfEncT5 with less than half of the parameters of T5 performs similarly to T5 models on GLUE benchmark.
Score: 9.141586109808895
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Encoder-decoder transformer architectures have become popular recently with the advent of T5 models. It is also more favorable over architectures like BERT for pre-training on language model task when it comes to large scale models which could take months to train given it's generality. While being able to generalize to more tasks, it is not evident if the proposed encoder-decoder architecture is the most efficient for fine-tuning on classification and regression tasks given the pre-trained model. In this work, we study fine-tuning pre-trained encoder-decoder models such as T5. Particularly, we propose \textbf{EncT5} as a way to efficiently fine-tune pre-trained encoder-decoder T5 models for classification and regression tasks by using the encoder layers. Our experimental results show that \textbf{EncT5} with less than half of the parameters of T5 performs similarly to T5 models on GLUE benchmark. We believe our proposed approach can be easily applied to any pre-trained encoder-decoder model.

Related papers

Training and Inference Efficiency of Encoder-Decoder Speech Models [25.031622057759492]
We focus on the efficiency angle and ask the questions of whether we are training these speech models efficiently. We show that negligence in mini-batch sampling leads to more than 50% being spent on padding. We find that adjusting the model architecture to transfer model parameters from the decoder to the encoder results in a 3x inference speedup.
arXiv Detail & Related papers (2025-03-07T20:57:43Z)
Shallow Cross-Encoders for Low-Latency Retrieval [69.06104373460597]
Cross-Encoders based on large transformer models (such as BERT or T5) are computationally expensive and allow for scoring only a small number of documents within a reasonably small latency window. We show that weaker shallow transformer models (i.e., transformers with a limited number of layers) actually perform better than full-scale models when constrained to these practical low-latency settings.
arXiv Detail & Related papers (2024-03-29T15:07:21Z)
SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection [49.43407207482008]
SpacTor is a new training procedure consisting of a hybrid objective combining span corruption (SC) and token replacement detection (RTD) In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training.
arXiv Detail & Related papers (2024-01-24T00:36:13Z)
UT5: Pretraining Non autoregressive T5 with unrolled denoising [9.656399724144192]
We studied unsupervised pretraining for non auto-regressive T5 models via unrolled denoising. We showed its SoTA results in downstream generation tasks such as SQuAD question generation and XSum.
arXiv Detail & Related papers (2023-11-14T21:28:10Z)
nanoT5: A PyTorch Framework for Pre-training and Fine-tuning T5-style Models with Limited Resources [1.9813574408340644]
We present nanoT5, a framework for efficient pre-training and fine-tuning of T5 models. nanoT5 allows a T5-Base model to be pre-trained on a single GPU in just 16 hours, without any loss in performance. We make our contributions, including configurations, insights, and pre-trained models, available to the public.
arXiv Detail & Related papers (2023-09-05T16:35:41Z)
ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking Inference [70.36083572306839]
This paper proposes a new training and inference paradigm for re-ranking. We finetune a pretrained encoder-decoder model using in the form of document to query generation. We show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference.
arXiv Detail & Related papers (2022-04-25T06:26:29Z)
What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? [50.84738303888189]
We present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. We train models with over 5 billion parameters for more than 170 billion tokens. We find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models.
arXiv Detail & Related papers (2022-04-12T14:19:49Z)
LongT5: Efficient Text-To-Text Transformer for Long Sequences [8.743996838160825]
We present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.
arXiv Detail & Related papers (2021-12-15T06:35:29Z)
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers [57.931830650323]
This paper presents scaling insights from pretraining and finetuning Transformers. We show that aside from only the model size, model shape matters for downstream fine-tuning. We present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality.
arXiv Detail & Related papers (2021-09-22T12:29:15Z)
Primer: Searching for Efficient Transformers for Language Modeling [79.2677566332444]
Training and inference costs of large Transformer models have grown rapidly and become expensive. Here we aim to reduce the costs of Transformers by searching for a more efficient variant. We identify an architecture, named Primer, that has a smaller training cost than the original Transformer.
arXiv Detail & Related papers (2021-09-17T17:50:39Z)
Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models [10.645591218689058]
We provide the first exploration of text-to-text transformers (T5) sentence embeddings. We investigate three methods for extracting T5 sentence embeddings. Our encoder-only models outperforms BERT-based sentence embeddings on both transfer tasks and semantic textual similarity.
arXiv Detail & Related papers (2021-08-19T18:58:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.