Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot
- URL: http://arxiv.org/abs/2406.06893v1
- Date: Tue, 11 Jun 2024 02:15:53 GMT
- Title: Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot
- Authors: Zixuan Wang, Stanley Wei, Daniel Hsu, Jason D. Lee,
- Abstract summary: transformer architecture has prevailed in various deep learning settings.
One-layer transformer trained with gradient descent provably learns the sparse token selection task.
- Score: 50.16171384920963
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The transformer architecture has prevailed in various deep learning settings due to its exceptional capabilities to select and compose structural information. Motivated by these capabilities, Sanford et al. proposed the sparse token selection task, in which transformers excel while fully-connected networks (FCNs) fail in the worst case. Building upon that, we strengthen the FCN lower bound to an average-case setting and establish an algorithmic separation of transformers over FCNs. Specifically, a one-layer transformer trained with gradient descent provably learns the sparse token selection task and, surprisingly, exhibits strong out-of-distribution length generalization. We provide empirical simulations to justify our theoretical findings.
Related papers
- Adaptive Pruning of Pretrained Transformer via Differential Inclusions [48.47890215458465]
Current compression algorithms prune transformers at fixed compression ratios, requiring a unique pruning process for each ratio.
We propose pruning of pretrained transformers at any desired ratio within a single pruning stage, based on a differential inclusion for a mask parameter.
This dynamic can generate the whole regularization solution path of the mask parameter, whose support set identifies the network structure.
arXiv Detail & Related papers (2025-01-06T06:34:52Z) - Transformers Simulate MLE for Sequence Generation in Bayesian Networks [18.869174453242383]
We investigate the theoretical capabilities of transformers to autoregressively generate sequences in Bayesian networks based on in-context maximum likelihood estimation (MLE)
We demonstrate that there exists a simple transformer model that can estimate the conditional probabilities of the Bayesian network according to the context.
We further demonstrate in extensive experiments that such a transformer does not only exist in theory, but can also be effectively obtained through training.
arXiv Detail & Related papers (2025-01-05T13:56:51Z) - Equivariant Neural Functional Networks for Transformers [2.3963215252605172]
This paper systematically explores neural functional networks (NFN) for transformer architectures.
NFN are specialized neural networks that treat the weights, gradients, or sparsity patterns of a deep neural network (DNN) as input data.
arXiv Detail & Related papers (2024-10-05T15:56:57Z) - Transformer Neural Autoregressive Flows [48.68932811531102]
Density estimation can be performed using Normalizing Flows (NFs)
We propose a novel solution by exploiting transformers to define a new class of neural flows called Transformer Neural Autoregressive Flows (T-NAFs)
arXiv Detail & Related papers (2024-01-03T17:51:16Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z) - Deep Transformers without Shortcuts: Modifying Self-attention for
Faithful Signal Propagation [105.22961467028234]
Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs)
Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them.
But these approaches are incompatible with the self-attention layers present in transformers.
arXiv Detail & Related papers (2023-02-20T21:26:25Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.