Related papers: Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot

Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot

URL: http://arxiv.org/abs/2406.06893v1
Date: Tue, 11 Jun 2024 02:15:53 GMT
Title: Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot
Authors: Zixuan Wang, Stanley Wei, Daniel Hsu, Jason D. Lee,
Abstract summary: transformer architecture has prevailed in various deep learning settings. One-layer transformer trained with gradient descent provably learns the sparse token selection task.
Score: 50.16171384920963
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The transformer architecture has prevailed in various deep learning settings due to its exceptional capabilities to select and compose structural information. Motivated by these capabilities, Sanford et al. proposed the sparse token selection task, in which transformers excel while fully-connected networks (FCNs) fail in the worst case. Building upon that, we strengthen the FCN lower bound to an average-case setting and establish an algorithmic separation of transformers over FCNs. Specifically, a one-layer transformer trained with gradient descent provably learns the sparse token selection task and, surprisingly, exhibits strong out-of-distribution length generalization. We provide empirical simulations to justify our theoretical findings.

Related papers

Adaptive Pruning of Pretrained Transformer via Differential Inclusions [48.47890215458465]
Current compression algorithms prune transformers at fixed compression ratios, requiring a unique pruning process for each ratio. We propose pruning of pretrained transformers at any desired ratio within a single pruning stage, based on a differential inclusion for a mask parameter. This dynamic can generate the whole regularization solution path of the mask parameter, whose support set identifies the network structure.
arXiv Detail & Related papers (2025-01-06T06:34:52Z)
Transformers Simulate MLE for Sequence Generation in Bayesian Networks [18.869174453242383]
We investigate the theoretical capabilities of transformers to autoregressively generate sequences in Bayesian networks based on in-context maximum likelihood estimation (MLE) We demonstrate that there exists a simple transformer model that can estimate the conditional probabilities of the Bayesian network according to the context. We further demonstrate in extensive experiments that such a transformer does not only exist in theory, but can also be effectively obtained through training.
arXiv Detail & Related papers (2025-01-05T13:56:51Z)
Equivariant Neural Functional Networks for Transformers [2.3963215252605172]
This paper systematically explores neural functional networks (NFN) for transformer architectures. NFN are specialized neural networks that treat the weights, gradients, or sparsity patterns of a deep neural network (DNN) as input data.
arXiv Detail & Related papers (2024-10-05T15:56:57Z)
Transformer Neural Autoregressive Flows [48.68932811531102]
Density estimation can be performed using Normalizing Flows (NFs) We propose a novel solution by exploiting transformers to define a new class of neural flows called Transformer Neural Autoregressive Flows (T-NAFs)
arXiv Detail & Related papers (2024-01-03T17:51:16Z)
U-shaped Transformer: Retain High Frequency Context in Time Series Analysis [0.5710971447109949]
In this paper, we consider the low-pass characteristics of transformers and try to incorporate the advantages of them. We introduce patch merge and split operation to extract features with different scales and use larger datasets to fully make use of the transformer backbone. Our experiments demonstrate that the model performs at an advanced level across multiple datasets with relatively low cost.
arXiv Detail & Related papers (2023-07-18T07:15:26Z)
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL. We show that transformers can implement a broad class of standard machine learning algorithms in context. A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z)
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation [105.22961467028234]
Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs) Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them. But these approaches are incompatible with the self-attention layers present in transformers.
arXiv Detail & Related papers (2023-02-20T21:26:25Z)
Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks. We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers. Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z)
Content-Augmented Feature Pyramid Network with Light Linear Transformers [7.035864400598343]
transformers can adaptively aggregate similar features from a global view using self-attention mechanism. For object detection, Feature Pyramid Network (FPN) proposes feature interaction across layers and proves its extremely importance. In this paper, we utilize a linearized attention function to overcome above problems and build a novel architecture, named Content-Augmented Feature Pyramid Network (CA-FPN)
arXiv Detail & Related papers (2021-05-20T02:31:31Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.