Related papers: Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning

Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning

URL: http://arxiv.org/abs/2106.09226v1
Date: Thu, 17 Jun 2021 03:31:47 GMT
Title: Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning
Authors: Colin Wei, Sang Michael Xie, Tengyu Ma
Abstract summary: We propose an analysis framework that links the pretraining and downstream tasks with an underlying latent variable generative model of text. We show that 1) under certain non-degeneracy conditions on the HMM, simple classification heads can solve the downstream task, 2) prompt tuning obtains downstream guarantees with weaker non-degeneracy conditions, and 3) our recovery guarantees for the memory-augmented HMM are stronger than for the vanilla HMM.
Score: 66.44344616836158
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pretrained language models have achieved state-of-the-art performance when adapted to a downstream NLP task. However, theoretical analysis of these models is scarce and challenging since the pretraining and downstream tasks can be very different. We propose an analysis framework that links the pretraining and downstream tasks with an underlying latent variable generative model of text -- the downstream classifier must recover a function of the posterior distribution over the latent variables. We analyze head tuning (learning a classifier on top of the frozen pretrained model) and prompt tuning in this setting. The generative model in our analysis is either a Hidden Markov Model (HMM) or an HMM augmented with a latent memory component, motivated by long-term dependencies in natural language. We show that 1) under certain non-degeneracy conditions on the HMM, simple classification heads can solve the downstream task, 2) prompt tuning obtains downstream guarantees with weaker non-degeneracy conditions, and 3) our recovery guarantees for the memory-augmented HMM are stronger than for the vanilla HMM because task-relevant information is easier to recover from the long-term memory. Experiments on synthetically generated data from HMMs back our theoretical findings.

Related papers

What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Investigating the Impact of Model Complexity in Large Language Models [3.7919508292745676]
Large Language Models (LLMs) based on the pre-trained fine-tuning paradigm have become pivotal in solving natural language processing tasks. In this paper, we focus on autoregressive LLMs and propose to employ Hidden Markov Models (HMMs) to model them.
arXiv Detail & Related papers (2024-10-01T13:53:44Z)
Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities. In-Context Learning (ICL) and. Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting. LLMs to downstream tasks. We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z)
Can recurrent neural networks learn process model structure? [0.2580765958706854]
We introduce an evaluation framework that combines variant-based resampling and custom metrics for fitness, precision and generalization. We confirm that LSTMs can struggle to learn process model structure, even with simplistic process data. We also found that decreasing the amount of information seen by the LSTM during training, causes a sharp drop in generalization and precision scores.
arXiv Detail & Related papers (2022-12-13T08:40:01Z)
From Cloze to Comprehension: Retrofitting Pre-trained Masked Language Model to Pre-trained Machine Reader [130.45769668885487]
Pre-trained Machine Reader (PMR) is a novel method for retrofitting masked language models (MLMs) to pre-trained machine reading comprehension (MRC) models without acquiring labeled data. To build the proposed PMR, we constructed a large volume of general-purpose and high-quality MRC-style training data. PMR has the potential to serve as a unified model for tackling various extraction and classification tasks in the MRC formulation.
arXiv Detail & Related papers (2022-12-09T10:21:56Z)
Robust Classification using Hidden Markov Models and Mixtures of Normalizing Flows [25.543231171094384]
We use a generative model that combines the state transitions of a hidden Markov model (HMM) and the neural network based probability distributions for the hidden states of the HMM. We verify the improved robustness of NMM-HMM classifiers in an application to speech recognition.
arXiv Detail & Related papers (2021-02-15T00:40:30Z)
Scaling Hidden Markov Language Models [118.55908381553056]
This work revisits the challenge of scaling HMMs to language modeling datasets. We propose methods for scaling HMMs to massive state spaces while maintaining efficient exact inference, a compact parameterization, and effective regularization.
arXiv Detail & Related papers (2020-11-09T18:51:55Z)
On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer [40.63693071222628]
We study the minimum word error rate (MWER) training of Hybrid Autoregressive Transducer (HAT) From experiments with around 30,000 hours of training data, we show that MWER training can improve the accuracy of HAT models.
arXiv Detail & Related papers (2020-10-23T21:16:30Z)
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks [133.93803565077337]
retrieval-augmented generation models combine pre-trained parametric and non-parametric memory for language generation. We show that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
arXiv Detail & Related papers (2020-05-22T21:34:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.