Related papers: Prompting a Pretrained Transformer Can Be a Universal Approximator

Prompting a Pretrained Transformer Can Be a Universal Approximator

URL: http://arxiv.org/abs/2402.14753v1
Date: Thu, 22 Feb 2024 18:12:48 GMT
Title: Prompting a Pretrained Transformer Can Be a Universal Approximator
Authors: Aleksandar Petrov, Philip H.S. Torr, Adel Bibi
Abstract summary: We show that much smaller pretrained models than previously thought can be universal approximators when prefixed. We also offer Jackson-type bounds on the length of the prefix needed to approximate a function to a desired precision.
Score: 105.59562522323274
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Despite the widespread adoption of prompting, prompt tuning and prefix-tuning of transformer models, our theoretical understanding of these fine-tuning methods remains limited. A key question is whether one can arbitrarily modify the behavior of pretrained model by prompting or prefix-tuning it. Formally, whether prompting and prefix-tuning a pretrained model can universally approximate sequence-to-sequence functions. This paper answers in the affirmative and demonstrates that much smaller pretrained models than previously thought can be universal approximators when prefixed. In fact, the attention mechanism is uniquely suited for universal approximation with prefix-tuning a single attention head being sufficient to approximate any continuous function. Moreover, any sequence-to-sequence function can be approximated by prefixing a transformer with depth linear in the sequence length. Beyond these density-type results, we also offer Jackson-type bounds on the length of the prefix needed to approximate a function to a desired precision.

Related papers

Position as Probability: Self-Supervised Transformers that Think Past Their Training for Length Extrapolation [0.0]
PRISM is a novel positional encoding mechanism that enables Transformers to extrapolate accurately up to 10x beyond their training length.<n>Our analysis demonstrates that PRISM's positional encoding maintains sharp and interpretable internal states, providing a theoretical basis for reliable length generalization.
arXiv Detail & Related papers (2025-06-01T09:20:44Z)
Universal Approximation with Softmax Attention [10.857177487536656]
We prove that both (i) two-layer self-attention and (ii) one-layer self-attention are universal approximators for continuous sequence-to-sequence functions on compact domains.
arXiv Detail & Related papers (2025-04-22T14:51:33Z)
The Role of Sparsity for Length Generalization in Transformers [58.65997625433689]
We propose a new theoretical framework to study length generalization for the next-token prediction task. We show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens. We introduce Predictive Position Coupling, which trains the transformer to predict the position IDs used in a positional coupling approach.
arXiv Detail & Related papers (2025-02-24T03:01:03Z)
Adversarial Testing as a Tool for Interpretability: Length-based Overfitting of Elementary Functions in Transformers [0.0]
We study elementary edit functions using a defined set of error indicators to interpret the behaviour of the sequence-to-sequence Transformer. We show that generalization to shorter sequences is often possible, but confirm that longer sequences are highly problematic.
arXiv Detail & Related papers (2024-10-17T17:39:46Z)
Transformers As Approximations of Solomonoff Induction [7.890110890837779]
Solomonoff Induction is an optimal-in-the-limit algorithm for sequence prediction. Being an optimal form of computational sequence prediction, it seems plausible that it may be used as a model against which other methods of sequence prediction might be compared. We put forth and explore the hypothesis that Transformer models approximate Solomonoff Induction better than any other extant sequence prediction method.
arXiv Detail & Related papers (2024-08-22T02:05:44Z)
Universality and Limitations of Prompt Tuning [65.8354898840308]
We take one of the first steps to understand the role of soft-prompt tuning for transformer-based architectures. We analyze prompt tuning from the lens of universality and limitations with finite-depth pretrained transformers for continuous-valued functions. Our result guarantees the existence of a strong transformer with a prompt to approximate any sequence-to-sequence function in the set of Lipschitz functions.
arXiv Detail & Related papers (2023-05-30T06:47:07Z)
Sampled Transformer for Point Sets [80.66097006145999]
sparse transformer can reduce the computational complexity of the self-attention layers to $O(n)$, whilst still being a universal approximator of continuous sequence-to-sequence functions. We propose an $O(n)$ complexity sampled transformer that can process point set elements directly without any additional inductive bias.
arXiv Detail & Related papers (2023-02-28T06:38:05Z)
Inducer-tuning: Connecting Prefix-tuning and Adapter-tuning [53.72897232951918]
We show that inducer-tuning can close the performance gap between prefix-tuning and fine-tuning. We suggest a new variant of prefix-tuning -- textitinducer-tuning, which shares the exact mechanism as prefix-tuning while leveraging the residual form found in adapter-tuning.
arXiv Detail & Related papers (2022-10-26T04:39:42Z)
Alleviate Exposure Bias in Sequence Prediction \\ with Recurrent Neural Networks [47.52214243454995]
A popular strategy to train recurrent neural networks (RNNs) is to take the ground truth as input at each time step. We propose a fully differentiable training algorithm for RNNs to better capture long-term dependencies.
arXiv Detail & Related papers (2021-03-22T06:15:22Z)
Pretrained Transformers as Universal Computation Engines [105.00539596788127]
We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning. We study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. We find that such pretraining enables FPT to generalize in zero-shot to these modalities, matching the performance of a transformer fully trained on these tasks.
arXiv Detail & Related papers (2021-03-09T06:39:56Z)
Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition [66.47000813920617]
We propose a spike-triggered non-autoregressive transformer model for end-to-end speech recognition. The proposed model can accurately predict the length of the target sequence and achieve a competitive performance. The model even achieves a real-time factor of 0.0056, which exceeds all mainstream speech recognition models.
arXiv Detail & Related papers (2020-05-16T08:27:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.