On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding
- URL: http://arxiv.org/abs/2410.01405v2
- Date: Tue, 8 Oct 2024 16:41:40 GMT
- Title: On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding
- Authors: Kevin Xu, Issei Sato,
- Abstract summary: Looped Transformers offer advantages in parameter efficiency and Turing completeness.
We establish approximation rates of Looped Transformers by defining the concept of the modulus of continuity for sequence-to-sequence functions.
- Score: 32.01426831450348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Looped Transformers offer advantages in parameter efficiency and Turing completeness. However, their expressive power for function approximation and approximation rate remains underexplored. In this paper, we establish approximation rates of Looped Transformers by defining the concept of the modulus of continuity for sequence-to-sequence functions. This reveals a limitation specific to the looped architecture. That is, the analysis prompts us to incorporate scaling parameters for each loop, conditioned on timestep encoding. Experimental results demonstrate that increasing the number of loops enhances performance, with further gains achieved through the timestep encoding architecture.
Related papers
- Fourier Series Guided Design of Quantum Convolutional Neural Networks for Enhanced Time Series Forecasting [0.9060149007362646]
We apply 1D quantum convolution to address the task of time series forecasting.
By encoding multiple points into the quantum circuit to predict subsequent data, each point becomes a feature, transforming the problem into a multidimensional one.
arXiv Detail & Related papers (2024-04-23T01:35:07Z) - Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices.
Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z) - Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers [24.109312575970456]
We propose a simple framework to enable the offthe-shelf pre-trained transformers to process much longer sequences.
Our method divides each long-sequence input into a batch of chunks, then aligns the interchunk information during the encoding steps.
We learn an effective hidden selection policy, which regards the decoders of transformers as environments.
arXiv Detail & Related papers (2023-08-25T05:52:05Z) - Sumformer: Universal Approximation for Efficient Transformers [2.4832703558223725]
We introduce Sumformer, a novel and simple architecture capable of universally approxingimating sequence-to-sequence functions.
We derive a new proof for Transformers, showing that just one attention layer is sufficient for universal approximation.
arXiv Detail & Related papers (2023-07-05T13:59:35Z) - Universality and Limitations of Prompt Tuning [65.8354898840308]
We take one of the first steps to understand the role of soft-prompt tuning for transformer-based architectures.
We analyze prompt tuning from the lens of universality and limitations with finite-depth pretrained transformers for continuous-valued functions.
Our result guarantees the existence of a strong transformer with a prompt to approximate any sequence-to-sequence function in the set of Lipschitz functions.
arXiv Detail & Related papers (2023-05-30T06:47:07Z) - Approximation and Estimation Ability of Transformers for
Sequence-to-Sequence Functions with Infinite Dimensional Input [50.83356836818667]
We study the approximation and estimation ability of Transformers as sequence-to-sequence functions with infinite dimensional inputs.
Our theoretical results support the practical success of Transformers for high dimensional data.
arXiv Detail & Related papers (2023-05-30T02:44:49Z) - Performance Embeddings: A Similarity-based Approach to Automatic
Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications.
We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - Adaptive pruning-based optimization of parameterized quantum circuits [62.997667081978825]
Variisy hybrid quantum-classical algorithms are powerful tools to maximize the use of Noisy Intermediate Scale Quantum devices.
We propose a strategy for such ansatze used in variational quantum algorithms, which we call "Efficient Circuit Training" (PECT)
Instead of optimizing all of the ansatz parameters at once, PECT launches a sequence of variational algorithms.
arXiv Detail & Related papers (2020-10-01T18:14:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.