Related papers: Universality and Limitations of Prompt Tuning

Universality and Limitations of Prompt Tuning

URL: http://arxiv.org/abs/2305.18787v2
Date: Thu, 16 Nov 2023 08:26:59 GMT
Title: Universality and Limitations of Prompt Tuning
Authors: Yihan Wang, Jatin Chauhan, Wei Wang, Cho-Jui Hsieh
Abstract summary: We take one of the first steps to understand the role of soft-prompt tuning for transformer-based architectures. We analyze prompt tuning from the lens of universality and limitations with finite-depth pretrained transformers for continuous-valued functions. Our result guarantees the existence of a strong transformer with a prompt to approximate any sequence-to-sequence function in the set of Lipschitz functions.
Score: 65.8354898840308
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the demonstrated empirical efficacy of prompt tuning to adapt a pretrained language model for a new task, the theoretical underpinnings of the difference between "tuning parameters before the input" against "the tuning of model weights" are limited. We thus take one of the first steps to understand the role of soft-prompt tuning for transformer-based architectures. By considering a general purpose architecture, we analyze prompt tuning from the lens of both: universal approximation and limitations with finite-depth fixed-weight pretrained transformers for continuous-valued functions. Our universality result guarantees the existence of a strong transformer with a prompt to approximate any sequence-to-sequence function in the set of Lipschitz functions. The limitations of prompt tuning for limited-depth transformers are first proved by constructing a set of datasets, that cannot be memorized by a prompt of any length for a given single encoder layer. We also provide a lower bound on the required number of tunable prompt parameters and compare the result with the number of parameters required for a low-rank update (based on LoRA) for a single-layer setting. We finally extend our analysis to multi-layer settings by providing sufficient conditions under which the transformer can at best learn datasets from invertible functions only. Our theoretical claims are also corroborated by empirical results.

Related papers

Adaptive Two Sided Laplace Transforms: A Learnable, Interpretable, and Scalable Replacement for Self-Attention [0.0]
We propose an innovative, learnable two-sided short-time Laplace transform (STLT) mechanism to supplant the traditional self attention in transformer-based LLMs.<n>Our STLT introduces trainable parameters for each Laplace node, enabling end-to-end learning of decay rates.<n>We further incorporate an efficient FFT-based computation of the relevance matrix and an adaptive node allocation mechanism to dynamically adjust the number of active Laplace nodes.
arXiv Detail & Related papers (2025-06-01T00:32:24Z)
Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency [13.566489504237868]
Key contributions are prompt tuning on textitsingle-head transformers with only a textitsingle self-attention layer. We prove that prompt tuning on such simplest possible transformers are universal approximators for sequence-to-sequence Lipschitz functions.
arXiv Detail & Related papers (2024-11-25T16:12:17Z)
On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding [32.01426831450348]
Looped Transformers offer advantages in parameter efficiency and Turing completeness. We establish approximation rates of Looped Transformers by defining the concept of the modulus of continuity for sequence-to-sequence functions.
arXiv Detail & Related papers (2024-10-02T10:31:17Z)
Towards Infinite-Long Prefix in Transformer [18.24137806007111]
We study the ability of Prompting and context-based fine-tuning methods to match the performance of full parameter fine-tuning. We implement an algorithm that only needs to introduce and fine-tune a few extra trainable parameters instead of an infinite-long prefix. Our method achieves superior or competitive performance compared to existing methods like full parameters fine-tuning, P-Tuning V2, and LoRA.
arXiv Detail & Related papers (2024-06-20T06:56:35Z)
Prompting a Pretrained Transformer Can Be a Universal Approximator [105.59562522323274]
We show that much smaller pretrained models than previously thought can be universal approximators when prefixed. We also offer Jackson-type bounds on the length of the prefix needed to approximate a function to a desired precision.
arXiv Detail & Related papers (2024-02-22T18:12:48Z)
Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks. We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z)
Towards Adaptive Prefix Tuning for Parameter-Efficient Language Model Fine-tuning [32.84435258519842]
We propose Adaptive Prefix Tuning (APT) to adjust the prefix in terms of both fine-grained token level and coarse-grained layer level with a gate mechanism. Experiments on the SuperGLUE and NER datasets show the effectiveness of APT.
arXiv Detail & Related papers (2023-05-24T14:51:01Z)
Prompt Tuning for Generative Multimodal Pretrained Models [75.44457974275154]
We implement prompt tuning on the unified sequence-to-sequence pretrained model adaptive to both understanding and generation tasks. Experimental results demonstrate that the light-weight prompt tuning can achieve comparable performance with finetuning. In comparison with finetuned models, the prompt-tuned models demonstrate improved robustness against adversarial attacks.
arXiv Detail & Related papers (2022-08-04T08:56:38Z)
Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z)
Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search [84.94597821711808]
We extend PoWER-BERT (Goyal et al., 2020) and propose Length-Adaptive Transformer that can be used for various inference scenarios after one-shot training. We conduct a multi-objective evolutionary search to find a length configuration that maximizes the accuracy and minimizes the efficiency metric under any given computational budget. We empirically verify the utility of the proposed approach by demonstrating the superior accuracy-efficiency trade-off under various setups.
arXiv Detail & Related papers (2020-10-14T12:28:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.