Related papers: On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding

On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding

URL: http://arxiv.org/abs/2410.01405v2
Date: Tue, 8 Oct 2024 16:41:40 GMT
Title: On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding
Authors: Kevin Xu, Issei Sato,
Abstract summary: Looped Transformers offer advantages in parameter efficiency and Turing completeness. We establish approximation rates of Looped Transformers by defining the concept of the modulus of continuity for sequence-to-sequence functions.
Score: 32.01426831450348
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Looped Transformers offer advantages in parameter efficiency and Turing completeness. However, their expressive power for function approximation and approximation rate remains underexplored. In this paper, we establish approximation rates of Looped Transformers by defining the concept of the modulus of continuity for sequence-to-sequence functions. This reveals a limitation specific to the looped architecture. That is, the analysis prompts us to incorporate scaling parameters for each loop, conditioned on timestep encoding. Experimental results demonstrate that increasing the number of loops enhances performance, with further gains achieved through the timestep encoding architecture.

Related papers

To CoT or To Loop? A Formal Comparison Between Chain-of-Thought and Looped Transformers [32.01426831450348]
Chain-of-Thought (CoT) and Looped Transformers have been shown to empirically improve performance on reasoning tasks.<n>We provide a formal analysis of their respective strengths and limitations.
arXiv Detail & Related papers (2025-05-25T17:49:37Z)
Fourier Series Guided Design of Quantum Convolutional Neural Networks for Enhanced Time Series Forecasting [0.9060149007362646]
We apply 1D quantum convolution to address the task of time series forecasting. By encoding multiple points into the quantum circuit to predict subsequent data, each point becomes a feature, transforming the problem into a multidimensional one.
arXiv Detail & Related papers (2024-04-23T01:35:07Z)
Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z)
Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers [24.109312575970456]
We propose a simple framework to enable the offthe-shelf pre-trained transformers to process much longer sequences. Our method divides each long-sequence input into a batch of chunks, then aligns the interchunk information during the encoding steps. We learn an effective hidden selection policy, which regards the decoders of transformers as environments.
arXiv Detail & Related papers (2023-08-25T05:52:05Z)
Sumformer: Universal Approximation for Efficient Transformers [2.4832703558223725]
We introduce Sumformer, a novel and simple architecture capable of universally approxingimating sequence-to-sequence functions. We derive a new proof for Transformers, showing that just one attention layer is sufficient for universal approximation.
arXiv Detail & Related papers (2023-07-05T13:59:35Z)
Universality and Limitations of Prompt Tuning [65.8354898840308]
We take one of the first steps to understand the role of soft-prompt tuning for transformer-based architectures. We analyze prompt tuning from the lens of universality and limitations with finite-depth pretrained transformers for continuous-valued functions. Our result guarantees the existence of a strong transformer with a prompt to approximate any sequence-to-sequence function in the set of Lipschitz functions.
arXiv Detail & Related papers (2023-05-30T06:47:07Z)
Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input [50.83356836818667]
We study the approximation and estimation ability of Transformers as sequence-to-sequence functions with infinite dimensional inputs. Our theoretical results support the practical success of Transformers for high dimensional data.
arXiv Detail & Related papers (2023-05-30T02:44:49Z)
Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications. We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z)
HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions. We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z)
The prediction of the quality of results in Logic Synthesis using Transformer and Graph Neural Networks [7.194397166633194]
We propose a deep learning method to predict the quality of circuit-optimization sequences pairs. The Transformer and three typical GNNs are used as a joint learning policy for the QoR prediction of the unseen circuit-optimization sequences. The experimental results show that the joint learning of Transformer and GraphSage gives the best results.
arXiv Detail & Related papers (2022-07-23T06:53:57Z)
Performance Evaluations of Noisy Approximate Quantum Fourier Arithmetic [1.1140384738063092]
We implement QFT-based integer addition and multiplications on quantum computers. These operations are fundamental to various quantum applications. We evaluate these implementations based on IBM's superconducting qubit architecture.
arXiv Detail & Related papers (2021-12-17T06:51:18Z)
Adaptive pruning-based optimization of parameterized quantum circuits [62.997667081978825]
Variisy hybrid quantum-classical algorithms are powerful tools to maximize the use of Noisy Intermediate Scale Quantum devices. We propose a strategy for such ansatze used in variational quantum algorithms, which we call "Efficient Circuit Training" (PECT) Instead of optimizing all of the ansatz parameters at once, PECT launches a sequence of variational algorithms.
arXiv Detail & Related papers (2020-10-01T18:14:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.