Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot
- URL: http://arxiv.org/abs/2406.06893v1
- Date: Tue, 11 Jun 2024 02:15:53 GMT
- Title: Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot
- Authors: Zixuan Wang, Stanley Wei, Daniel Hsu, Jason D. Lee,
- Abstract summary: transformer architecture has prevailed in various deep learning settings.
One-layer transformer trained with gradient descent provably learns the sparse token selection task.
- Score: 50.16171384920963
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The transformer architecture has prevailed in various deep learning settings due to its exceptional capabilities to select and compose structural information. Motivated by these capabilities, Sanford et al. proposed the sparse token selection task, in which transformers excel while fully-connected networks (FCNs) fail in the worst case. Building upon that, we strengthen the FCN lower bound to an average-case setting and establish an algorithmic separation of transformers over FCNs. Specifically, a one-layer transformer trained with gradient descent provably learns the sparse token selection task and, surprisingly, exhibits strong out-of-distribution length generalization. We provide empirical simulations to justify our theoretical findings.
Related papers
- Equivariant Neural Functional Networks for Transformers [2.3963215252605172]
This paper systematically explores neural functional networks (NFN) for transformer architectures.
NFN are specialized neural networks that treat the weights, gradients, or sparsity patterns of a deep neural network (DNN) as input data.
arXiv Detail & Related papers (2024-10-05T15:56:57Z) - Transformer Neural Autoregressive Flows [48.68932811531102]
Density estimation can be performed using Normalizing Flows (NFs)
We propose a novel solution by exploiting transformers to define a new class of neural flows called Transformer Neural Autoregressive Flows (T-NAFs)
arXiv Detail & Related papers (2024-01-03T17:51:16Z) - U-shaped Transformer: Retain High Frequency Context in Time Series
Analysis [0.5710971447109949]
In this paper, we consider the low-pass characteristics of transformers and try to incorporate the advantages of them.
We introduce patch merge and split operation to extract features with different scales and use larger datasets to fully make use of the transformer backbone.
Our experiments demonstrate that the model performs at an advanced level across multiple datasets with relatively low cost.
arXiv Detail & Related papers (2023-07-18T07:15:26Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z) - Deep Transformers without Shortcuts: Modifying Self-attention for
Faithful Signal Propagation [105.22961467028234]
Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs)
Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them.
But these approaches are incompatible with the self-attention layers present in transformers.
arXiv Detail & Related papers (2023-02-20T21:26:25Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - Content-Augmented Feature Pyramid Network with Light Linear Transformers [7.035864400598343]
transformers can adaptively aggregate similar features from a global view using self-attention mechanism.
For object detection, Feature Pyramid Network (FPN) proposes feature interaction across layers and proves its extremely importance.
In this paper, we utilize a linearized attention function to overcome above problems and build a novel architecture, named Content-Augmented Feature Pyramid Network (CA-FPN)
arXiv Detail & Related papers (2021-05-20T02:31:31Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.