Related papers: Memory Limitations of Prompt Tuning in Transformers

Memory Limitations of Prompt Tuning in Transformers

URL: http://arxiv.org/abs/2509.00421v1
Date: Sat, 30 Aug 2025 09:08:00 GMT
Title: Memory Limitations of Prompt Tuning in Transformers
Authors: Maxime Meyer, Mario Michelessa, Caroline Chaux, Vincent Y. F. Tan,
Abstract summary: We show that the amount of information memorized by a transformer cannot scale faster than linearly with the prompt length.<n>We also present the first formal proof of a phenomenon empirically observed in large language models: performance degradation.<n>This finding offers a fundamental understanding of the intrinsic limitations of transformer architectures.
Score: 45.158621811869466
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the empirical success of prompt tuning in adapting pretrained language models to new tasks, theoretical analyses of its capabilities remain limited. Existing theoretical work primarily addresses universal approximation properties, demonstrating results comparable to standard weight tuning. In this paper, we explore a different aspect of the theory of transformers: the memorization capability of prompt tuning. We provide two principal theoretical contributions. First, we prove that the amount of information memorized by a transformer cannot scale faster than linearly with the prompt length. Second, and more importantly, we present the first formal proof of a phenomenon empirically observed in large language models: performance degradation in transformers with extended contexts. We rigorously demonstrate that transformers inherently have limited memory, constraining the amount of information they can retain, regardless of the context size. This finding offers a fundamental understanding of the intrinsic limitations of transformer architectures, particularly their ability to handle long sequences.

Related papers

Quantitative Bounds for Length Generalization in Transformers [58.175107357008876]
We study the problem of length generalization (LG) in transformers.<n>LG occurs when the internal behavior of the transformer on longer sequences can be "simulated" by its behavior on shorter sequences.
arXiv Detail & Related papers (2025-10-30T21:31:36Z)
Transformers as Multi-task Learners: Decoupling Features in Hidden Markov Models [12.112842686827669]
Transformer based models have shown remarkable capabilities in sequence learning across a wide range of tasks.<n>We investigate the layerwise behavior of Transformers to uncover the mechanisms underlying their multi-task generalization ability.<n>Our explicit constructions align closely with empirical observations, providing theoretical support for the Transformer's effectiveness and efficiency on sequence learning across diverse tasks.
arXiv Detail & Related papers (2025-06-02T17:39:31Z)
Characterizing the Expressivity of Transformer Language Models [56.598551673153366]
We provide an exact characterization of fixed-precision transformers with strict future masking and soft attention.<n>We show that these models are precisely as expressive as a specific fragment of linear temporal logic.<n>We further relate this logic to established classes in formal language theory, automata theory, and algebra.
arXiv Detail & Related papers (2025-05-29T16:30:30Z)
Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities [58.742178800799614]
We study a family of $textitretrieval$ and $textitcopying$ tasks inspired by Liu et al.<n>We observe an $textitinduction-versus-anti-induction$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) than the left (anti-induction) of a query token.<n>Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers.
arXiv Detail & Related papers (2025-05-27T21:36:50Z)
Bottlenecked Transformers: Periodic KV Cache Abstraction for Generalised Reasoning [9.730604030100318]
Large Language Models struggle with generalisation beyond their training distribution.<n>IB theory posits that model generalisation emerges from an optimal balance between input compression and retention of predictive information in latent representations.<n>We show that decoder-only Transformers are inherently constrained in their ability to form task-optimal sequence representations.<n>We propose a modification to the Transformer architecture, in the form of an additional module that globally rewrites the KV cache.
arXiv Detail & Related papers (2025-05-22T17:33:49Z)
Transformers for Learning on Noisy and Task-Level Manifolds: Approximation and Generalization Insights [47.62295798627317]
This work establishes a theoretical foundation by analyzing the performance of transformers for regression tasks involving noisy input data on a manifold.<n>We prove approximation and generalization errors which crucially depend on the intrinsic dimension of the manifold.<n>Our results demonstrate that transformers can leverage low-complexity structures in learning task even when the input data are perturbed by high-dimensional noise.
arXiv Detail & Related papers (2025-05-06T05:41:46Z)
Enhancing Transformers for Generalizable First-Order Logical Entailment [51.04944136538266]
This paper studies the generalizable first-order logical reasoning ability of transformers with their parameterized knowledge.<n>We propose TEGA, a logic-aware architecture that significantly improves the performance in first-order logical entailment.
arXiv Detail & Related papers (2025-01-01T07:05:32Z)
A Formal Framework for Understanding Length Generalization in Transformers [14.15513446489798]
We introduce a rigorous theoretical framework to analyze length generalization in causal transformers.<n>We experimentally validate the theory as a predictor of success and failure of length generalization across a range of algorithmic and formal language tasks.
arXiv Detail & Related papers (2024-10-03T01:52:01Z)
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory [11.3128832831327]
Increasing the size of a Transformer does not always lead to enhanced performance.<n>We present a theoretical framework that sheds light on the memorization during pre-training of transformer-based language models.
arXiv Detail & Related papers (2024-05-14T15:48:36Z)
Universality and Limitations of Prompt Tuning [65.8354898840308]
We take one of the first steps to understand the role of soft-prompt tuning for transformer-based architectures. We analyze prompt tuning from the lens of universality and limitations with finite-depth pretrained transformers for continuous-valued functions. Our result guarantees the existence of a strong transformer with a prompt to approximate any sequence-to-sequence function in the set of Lipschitz functions.
arXiv Detail & Related papers (2023-05-30T06:47:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.