p-Laplacian Transformer
- URL: http://arxiv.org/abs/2311.03235v1
- Date: Mon, 6 Nov 2023 16:25:56 GMT
- Title: p-Laplacian Transformer
- Authors: Tuan Nguyen, Tam Nguyen, Vinh Nguyen, Tan M. Nguyen
- Abstract summary: $p$-Laplacian regularization, rooted in graph and image signal processing, introduces a parameter $p$ to control the regularization effect on these data.
We first show that the self-attention mechanism obtains the minimal Laplacian regularization.
We then propose a novel class of transformers, namely the $p$-Laplacian Transformer (p-LaT)
- Score: 7.2541371193810384
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: $p$-Laplacian regularization, rooted in graph and image signal processing,
introduces a parameter $p$ to control the regularization effect on these data.
Smaller values of $p$ promote sparsity and interpretability, while larger
values encourage smoother solutions. In this paper, we first show that the
self-attention mechanism obtains the minimal Laplacian regularization ($p=2$)
and encourages the smoothness in the architecture. However, the smoothness is
not suitable for the heterophilic structure of self-attention in transformers
where attention weights between tokens that are in close proximity and
non-close ones are assigned indistinguishably. From that insight, we then
propose a novel class of transformers, namely the $p$-Laplacian Transformer
(p-LaT), which leverages $p$-Laplacian regularization framework to harness the
heterophilic features within self-attention layers. In particular, low $p$
values will effectively assign higher attention weights to tokens that are in
close proximity to the current token being processed. We empirically
demonstrate the advantages of p-LaT over the baseline transformers on a wide
range of benchmark datasets.
Related papers
- Pretrained transformer efficiently learns low-dimensional target functions in-context [40.77319247558742]
We show that a nonlinear transformer optimized by gradient descent learns $f_*$ in-context with a prompt length that only depends on the dimension of the distribution of target functions $r$.
Our result highlights the adaptivity of the pretrained transformer to low-dimensional structures of the function class, which enables sample-efficient ICL.
arXiv Detail & Related papers (2024-11-04T19:24:39Z) - Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - How do Transformers perform In-Context Autoregressive Learning? [76.18489638049545]
We train a Transformer model on a simple next token prediction task.
We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping.
arXiv Detail & Related papers (2024-02-08T16:24:44Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - Guided Patch-Grouping Wavelet Transformer with Spatial Congruence for
Ultra-High Resolution Segmentation [18.50799240622156]
Proposed Guided Patch-Grouping Wavelet Transformer (GPWFormer)
$mathcalT$ takes the whole UHR image as input and harvests both local details and fine-grained long-range contextual dependencies.
$mathcalC$ takes downsampled image as input for learning the category-wise deep context.
arXiv Detail & Related papers (2023-07-03T02:19:48Z) - Parameterization of Cross-Token Relations with Relative Positional
Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks.
They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers.
We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z) - Hybrid Model-based / Data-driven Graph Transform for Image Coding [54.31406300524195]
We present a hybrid model-based / data-driven approach to encode an intra-prediction residual block.
The first $K$ eigenvectors of a transform matrix are derived from a statistical model, e.g., the asymmetric discrete sine transform (ADST) for stability.
Using WebP as a baseline image, experimental results show that our hybrid graph transform achieved better energy compaction than default discrete cosine transform (DCT) and better stability than KLT.
arXiv Detail & Related papers (2022-03-02T15:36:44Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.