Choose a Transformer: Fourier or Galerkin
- URL: http://arxiv.org/abs/2105.14995v2
- Date: Thu, 3 Jun 2021 16:06:10 GMT
- Title: Choose a Transformer: Fourier or Galerkin
- Authors: Shuhao Cao
- Abstract summary: We apply the self-attention from the state-of-the-art Transformer in Attention Is All You Need to a data-driven operator learning problem.
We show that softmax normalization in the scaled dot-product attention is sufficient but not necessary, and have proved the approximation capacity of a linear variant as a Petrov-Galerkin projection.
We present three operator learning experiments, including the viscid Burgers' equation, an interface Darcy flow, and an inverse interface coefficient identification problem.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we apply the self-attention from the state-of-the-art
Transformer in Attention Is All You Need the first time to a data-driven
operator learning problem related to partial differential equations. We put
together an effort to explain the heuristics of, and improve the efficacy of
the self-attention by demonstrating that the softmax normalization in the
scaled dot-product attention is sufficient but not necessary, and have proved
the approximation capacity of a linear variant as a Petrov-Galerkin projection.
A new layer normalization scheme is proposed to allow a scaling to propagate
through attention layers, which helps the model achieve remarkable accuracy in
operator learning tasks with unnormalized data. Finally, we present three
operator learning experiments, including the viscid Burgers' equation, an
interface Darcy flow, and an inverse interface coefficient identification
problem. All experiments validate the improvements of the newly proposed simple
attention-based operator learner over their softmax-normalized counterparts.
Related papers
- Learning Linear Attention in Polynomial Time [115.68795790532289]
We provide the first results on learnability of single-layer Transformers with linear attention.
We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS.
We show how to efficiently identify training datasets for which every empirical riskr is equivalent to the linear Transformer.
arXiv Detail & Related papers (2024-10-14T02:41:01Z) - Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.
We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z) - Understanding Optimal Feature Transfer via a Fine-Grained Bias-Variance Analysis [10.79615566320291]
We explore transfer learning with the goal of optimizing downstream performance.
We introduce a simple linear model that takes as input an arbitrary pretrained feature.
We identify the optimal pretrained representation by minimizing the downstream risk averaged over an ensemble of downstream tasks.
arXiv Detail & Related papers (2024-04-18T19:33:55Z) - Invertible Fourier Neural Operators for Tackling Both Forward and
Inverse Problems [18.48295539583625]
We propose an invertible Fourier Neural Operator (iFNO) that tackles both the forward and inverse problems.
We integrated a variational auto-encoder to capture the intrinsic structures within the input space and to enable posterior inference.
The evaluations on five benchmark problems have demonstrated the effectiveness of our approach.
arXiv Detail & Related papers (2024-02-18T22:16:43Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Uncovering mesa-optimization algorithms in Transformers [61.06055590704677]
Some autoregressive models can learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so.
We show that standard next-token prediction error minimization gives rise to a subsidiary learning algorithm that adjusts the model as new inputs are revealed.
Our findings explain in-context learning as a product of autoregressive loss minimization and inform the design of new optimization-based Transformer layers.
arXiv Detail & Related papers (2023-09-11T22:42:50Z) - Physics-guided Data Augmentation for Learning the Solution Operator of
Linear Differential Equations [2.1850269949775663]
We propose a physics-guided data augmentation (PGDA) method to improve the accuracy and generalization of neural operator models.
We demonstrate the advantage of PGDA on a variety of linear differential equations, showing that PGDA can improve the sample complexity and is robust to distributional shift.
arXiv Detail & Related papers (2022-12-08T06:29:15Z) - Learning Operators with Coupled Attention [9.715465024071333]
We propose a novel operator learning method, LOCA, motivated from the recent success of the attention mechanism.
In our architecture the input functions are mapped to a finite set of features which are then averaged with attention weights that depend on the output query locations.
By coupling these attention weights together with an integral transform, LOCA is able to explicitly learn correlations in the target output functions.
arXiv Detail & Related papers (2022-01-04T08:22:03Z) - Factorized Fourier Neural Operators [77.47313102926017]
The Factorized Fourier Neural Operator (F-FNO) is a learning-based method for simulating partial differential equations.
We show that our model maintains an error rate of 2% while still running an order of magnitude faster than a numerical solver.
arXiv Detail & Related papers (2021-11-27T03:34:13Z) - Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU.
Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms.
Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.