Related papers: Choose a Transformer: Fourier or Galerkin

Choose a Transformer: Fourier or Galerkin

URL: http://arxiv.org/abs/2105.14995v2
Date: Thu, 3 Jun 2021 16:06:10 GMT
Title: Choose a Transformer: Fourier or Galerkin
Authors: Shuhao Cao
Abstract summary: We apply the self-attention from the state-of-the-art Transformer in Attention Is All You Need to a data-driven operator learning problem. We show that softmax normalization in the scaled dot-product attention is sufficient but not necessary, and have proved the approximation capacity of a linear variant as a Petrov-Galerkin projection. We present three operator learning experiments, including the viscid Burgers' equation, an interface Darcy flow, and an inverse interface coefficient identification problem.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we apply the self-attention from the state-of-the-art Transformer in Attention Is All You Need the first time to a data-driven operator learning problem related to partial differential equations. We put together an effort to explain the heuristics of, and improve the efficacy of the self-attention by demonstrating that the softmax normalization in the scaled dot-product attention is sufficient but not necessary, and have proved the approximation capacity of a linear variant as a Petrov-Galerkin projection. A new layer normalization scheme is proposed to allow a scaling to propagate through attention layers, which helps the model achieve remarkable accuracy in operator learning tasks with unnormalized data. Finally, we present three operator learning experiments, including the viscid Burgers' equation, an interface Darcy flow, and an inverse interface coefficient identification problem. All experiments validate the improvements of the newly proposed simple attention-based operator learner over their softmax-normalized counterparts.

Related papers

Zero-Variance Gradients for Variational Autoencoders [32.818968022327866]
Training deep generative models like Variational Autoencoders (VAEs) is often hindered by the need to backpropagate gradients through sampling of their latent variables.<n>In this paper, we propose a new perspective that sidesteps this problem, which we call Silent Gradients.<n>Instead of improving estimators, we leverage specific decoder architectures analytically to compute the expected ELBO, yielding a gradient with zero variance.
arXiv Detail & Related papers (2025-08-05T15:54:21Z)
Neural Attention: A Novel Mechanism for Enhanced Expressive Power in Transformer Models [0.0]
We propose a technique that replaces dot products with feed-forward networks, enabling a more expressive representation of relationships between tokens. This work establishes Neural Attention as an effective means of enhancing the predictive capabilities of transformer models across a variety of applications.
arXiv Detail & Related papers (2025-02-24T14:39:40Z)
Learning Linear Attention in Polynomial Time [115.68795790532289]
We provide the first results on learnability of single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. We show how to efficiently identify training datasets for which every empirical riskr is equivalent to the linear Transformer.
arXiv Detail & Related papers (2024-10-14T02:41:01Z)
Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture. We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z)
Understanding Optimal Feature Transfer via a Fine-Grained Bias-Variance Analysis [10.79615566320291]
We explore transfer learning with the goal of optimizing downstream performance. We introduce a simple linear model that takes as input an arbitrary pretrained feature. We identify the optimal pretrained representation by minimizing the downstream risk averaged over an ensemble of downstream tasks.
arXiv Detail & Related papers (2024-04-18T19:33:55Z)
Invertible Fourier Neural Operators for Tackling Both Forward and Inverse Problems [18.48295539583625]
We propose an invertible Fourier Neural Operator (iFNO) that tackles both the forward and inverse problems. We integrated a variational auto-encoder to capture the intrinsic structures within the input space and to enable posterior inference. The evaluations on five benchmark problems have demonstrated the effectiveness of our approach.
arXiv Detail & Related papers (2024-02-18T22:16:43Z)
In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent. For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z)
Uncovering mesa-optimization algorithms in Transformers [61.06055590704677]
Some autoregressive models can learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so. We show that standard next-token prediction error minimization gives rise to a subsidiary learning algorithm that adjusts the model as new inputs are revealed. Our findings explain in-context learning as a product of autoregressive loss minimization and inform the design of new optimization-based Transformer layers.
arXiv Detail & Related papers (2023-09-11T22:42:50Z)
Physics-guided Data Augmentation for Learning the Solution Operator of Linear Differential Equations [2.1850269949775663]
We propose a physics-guided data augmentation (PGDA) method to improve the accuracy and generalization of neural operator models. We demonstrate the advantage of PGDA on a variety of linear differential equations, showing that PGDA can improve the sample complexity and is robust to distributional shift.
arXiv Detail & Related papers (2022-12-08T06:29:15Z)
Learning Operators with Coupled Attention [9.715465024071333]
We propose a novel operator learning method, LOCA, motivated from the recent success of the attention mechanism. In our architecture the input functions are mapped to a finite set of features which are then averaged with attention weights that depend on the output query locations. By coupling these attention weights together with an integral transform, LOCA is able to explicitly learn correlations in the target output functions.
arXiv Detail & Related papers (2022-01-04T08:22:03Z)
Factorized Fourier Neural Operators [77.47313102926017]
The Factorized Fourier Neural Operator (F-FNO) is a learning-based method for simulating partial differential equations. We show that our model maintains an error rate of 2% while still running an order of magnitude faster than a numerical solver.
arXiv Detail & Related papers (2021-11-27T03:34:13Z)
Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.