On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding
- URL: http://arxiv.org/abs/2410.01405v4
- Date: Mon, 03 Feb 2025 07:30:21 GMT
- Title: On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding
- Authors: Kevin Xu, Issei Sato,
- Abstract summary: We establish the approximation rate of Looped Transformers by defining the modulus of continuity for sequence-to-sequence functions.
Experiments validate the theoretical results, showing that increasing the number of loops enhances performance.
- Score: 32.01426831450348
- License:
- Abstract: Looped Transformers provide advantages in parameter efficiency, computational capabilities, and generalization for reasoning tasks. However, their expressive power regarding function approximation remains underexplored. In this paper, we establish the approximation rate of Looped Transformers by defining the modulus of continuity for sequence-to-sequence functions. This reveals a limitation specific to the looped architecture. That is, the analysis prompts the incorporation of scaling parameters for each loop, conditioned on timestep encoding. Experiments validate the theoretical results, showing that increasing the number of loops enhances performance, with further gains achieved through the timestep encoding.
Related papers
- Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices.
Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z) - Sumformer: Universal Approximation for Efficient Transformers [2.4832703558223725]
We introduce Sumformer, a novel and simple architecture capable of universally approxingimating sequence-to-sequence functions.
We derive a new proof for Transformers, showing that just one attention layer is sufficient for universal approximation.
arXiv Detail & Related papers (2023-07-05T13:59:35Z) - Universality and Limitations of Prompt Tuning [65.8354898840308]
We take one of the first steps to understand the role of soft-prompt tuning for transformer-based architectures.
We analyze prompt tuning from the lens of universality and limitations with finite-depth pretrained transformers for continuous-valued functions.
Our result guarantees the existence of a strong transformer with a prompt to approximate any sequence-to-sequence function in the set of Lipschitz functions.
arXiv Detail & Related papers (2023-05-30T06:47:07Z) - Approximation and Estimation Ability of Transformers for
Sequence-to-Sequence Functions with Infinite Dimensional Input [50.83356836818667]
We study the approximation and estimation ability of Transformers as sequence-to-sequence functions with infinite dimensional inputs.
Our theoretical results support the practical success of Transformers for high dimensional data.
arXiv Detail & Related papers (2023-05-30T02:44:49Z) - Performance Embeddings: A Similarity-based Approach to Automatic
Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications.
We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - The prediction of the quality of results in Logic Synthesis using
Transformer and Graph Neural Networks [7.194397166633194]
We propose a deep learning method to predict the quality of circuit-optimization sequences pairs.
The Transformer and three typical GNNs are used as a joint learning policy for the QoR prediction of the unseen circuit-optimization sequences.
The experimental results show that the joint learning of Transformer and GraphSage gives the best results.
arXiv Detail & Related papers (2022-07-23T06:53:57Z) - Performance Evaluations of Noisy Approximate Quantum Fourier Arithmetic [1.1140384738063092]
We implement QFT-based integer addition and multiplications on quantum computers.
These operations are fundamental to various quantum applications.
We evaluate these implementations based on IBM's superconducting qubit architecture.
arXiv Detail & Related papers (2021-12-17T06:51:18Z) - Adaptive pruning-based optimization of parameterized quantum circuits [62.997667081978825]
Variisy hybrid quantum-classical algorithms are powerful tools to maximize the use of Noisy Intermediate Scale Quantum devices.
We propose a strategy for such ansatze used in variational quantum algorithms, which we call "Efficient Circuit Training" (PECT)
Instead of optimizing all of the ansatz parameters at once, PECT launches a sequence of variational algorithms.
arXiv Detail & Related papers (2020-10-01T18:14:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.