Universality and Limitations of Prompt Tuning
- URL: http://arxiv.org/abs/2305.18787v2
- Date: Thu, 16 Nov 2023 08:26:59 GMT
- Title: Universality and Limitations of Prompt Tuning
- Authors: Yihan Wang, Jatin Chauhan, Wei Wang, Cho-Jui Hsieh
- Abstract summary: We take one of the first steps to understand the role of soft-prompt tuning for transformer-based architectures.
We analyze prompt tuning from the lens of universality and limitations with finite-depth pretrained transformers for continuous-valued functions.
Our result guarantees the existence of a strong transformer with a prompt to approximate any sequence-to-sequence function in the set of Lipschitz functions.
- Score: 65.8354898840308
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the demonstrated empirical efficacy of prompt tuning to adapt a
pretrained language model for a new task, the theoretical underpinnings of the
difference between "tuning parameters before the input" against "the tuning of
model weights" are limited. We thus take one of the first steps to understand
the role of soft-prompt tuning for transformer-based architectures. By
considering a general purpose architecture, we analyze prompt tuning from the
lens of both: universal approximation and limitations with finite-depth
fixed-weight pretrained transformers for continuous-valued functions. Our
universality result guarantees the existence of a strong transformer with a
prompt to approximate any sequence-to-sequence function in the set of Lipschitz
functions. The limitations of prompt tuning for limited-depth transformers are
first proved by constructing a set of datasets, that cannot be memorized by a
prompt of any length for a given single encoder layer. We also provide a lower
bound on the required number of tunable prompt parameters and compare the
result with the number of parameters required for a low-rank update (based on
LoRA) for a single-layer setting. We finally extend our analysis to multi-layer
settings by providing sufficient conditions under which the transformer can at
best learn datasets from invertible functions only. Our theoretical claims are
also corroborated by empirical results.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.