Making the Most of your Model: Methods for Finetuning and Applying Pretrained Transformers
- URL: http://arxiv.org/abs/2408.16241v1
- Date: Thu, 29 Aug 2024 03:50:24 GMT
- Title: Making the Most of your Model: Methods for Finetuning and Applying Pretrained Transformers
- Authors: Davis Yoshida,
- Abstract summary: This thesis provides methods and analysis of models which make progress on this goal.
We introduce two new finetuning methods which add new capabilities to the models they are used on.
We provide theoretical and empirical insights on the divergence of model-likelihood and output quality.
- Score: 0.21756081703276003
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This thesis provides methods and analysis of models which make progress on this goal. The techniques outlined are task agnostic, and should provide benefit when used with nearly any transformer LM. We introduce two new finetuning methods which add new capabilities to the models they are used on. The first adds a recurrence mechanism, which removes the fixed-window sized constraint and improves the efficiency of a transformer decoder. The second allows masked language models (MLMs) to be used for initialization of both the encoder and decoder of a non-autoregressive sequence-to-sequence transformer, opening up generative applications of models which were previously only used for natural language understanding tasks. We also introduce two new techniques for improving the quality of predictions of any transformer decoder without additional finetuning. One, hidden state optimization, can be applied to any transformer decoder to improve the quality of predictions at inference time, especially for few-shot classification. The other, conditional beam search, allows practitioners to search for natural language generation (NLG) model outputs with high likelihood while conditioning on the event that the output is not degenerate (e.g. empty, repetitive, etc.). Finally, we provide theoretical and empirical insights on the divergence of model-likelihood and output quality which has widely been observed in prior work. These insights apply to any model which represents a distribution over text, and apply to language models which are not transformers or even autoregressive. We argue that the NLP community has, to some extent, misunderstood the implications of these findings, and encourage a point of view which has more nuance.
Related papers
- Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models [6.809572275782338]
We develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model.
Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores.
arXiv Detail & Related papers (2024-03-14T17:59:14Z) - Beyond Self-learned Attention: Mitigating Attention Bias in
Transformer-based Models Using Attention Guidance [9.486558126032639]
We introduce SyntaGuid, a novel approach to guide Transformer-based models towards critical source code tokens.
We show that SyntaGuid can improve overall performance up to 3.25% and fix up to 28.3% wrong predictions.
arXiv Detail & Related papers (2024-02-26T18:03:50Z) - Uncovering mesa-optimization algorithms in Transformers [61.06055590704677]
Some autoregressive models can learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so.
We show that standard next-token prediction error minimization gives rise to a subsidiary learning algorithm that adjusts the model as new inputs are revealed.
Our findings explain in-context learning as a product of autoregressive loss minimization and inform the design of new optimization-based Transformer layers.
arXiv Detail & Related papers (2023-09-11T22:42:50Z) - On Robustness of Finetuned Transformer-based NLP Models [11.063628128069736]
We characterize changes between pretrained and finetuned language model representations across layers using two metrics: CKA and STIR.
GPT-2 representations are more robust than BERT and T5 across multiple types of input perturbations.
This study provides valuable insights into perturbation-specific weaknesses of popular Transformer-based models.
arXiv Detail & Related papers (2023-05-23T18:25:18Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z) - Adding Recurrence to Pretrained Transformers for Improved Efficiency and
Context Size [41.624797099537375]
We present a novel method for applying pretrained transformer language models.
We find that our method attains better perplexity than an unmodified GPT-2 model on the PG-19 and WikiText-103 corpora.
arXiv Detail & Related papers (2020-08-16T23:19:30Z) - TERA: Self-Supervised Learning of Transformer Encoder Representation for
Speech [63.03318307254081]
TERA stands for Transformer Representations from Alteration.
We use alteration along three axes to pre-train Transformers on a large amount of unlabeled speech.
TERA can be used for speech representations extraction or fine-tuning with downstream models.
arXiv Detail & Related papers (2020-07-12T16:19:00Z) - DiscreTalk: Text-to-Speech as a Machine Translation Problem [52.33785857500754]
This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT)
The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model.
arXiv Detail & Related papers (2020-05-12T02:45:09Z) - Learning to Encode Position for Transformer with Continuous Dynamical
Model [88.69870971415591]
We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models.
We model the evolution of encoded results along position index by such a dynamical system.
arXiv Detail & Related papers (2020-03-13T00:41:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.