Related papers: Inductive Biases and Variable Creation in Self-Attention Mechanisms

Inductive Biases and Variable Creation in Self-Attention Mechanisms

URL: http://arxiv.org/abs/2110.10090v1
Date: Tue, 19 Oct 2021 16:36:19 GMT
Title: Inductive Biases and Variable Creation in Self-Attention Mechanisms
Authors: Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Cyril Zhang
Abstract summary: This work provides a theoretical analysis of the inductive biases of self-attention modules. Our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer layers create sparse variables.
Score: 25.79946667926312
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the inductive biases of self-attention modules, where our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer layers create sparse variables: they can represent sparse functions of the input sequence, with sample complexity scaling only logarithmically with the context length. Furthermore, we propose new experimental protocols to support this analysis and to guide the practice of training Transformers, built around the large body of work on provably learning sparse Boolean functions.

Related papers

Provable In-Context Learning of Nonlinear Regression with Transformers [58.018629320233174]
In-context learning (ICL) is the ability to perform unseen tasks using task-specific prompts without updating parameters.<n>Recent research has actively explored the training dynamics behind ICL.<n>This paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities.
arXiv Detail & Related papers (2025-07-28T00:09:28Z)
Transformers Are Universally Consistent [14.904264782690639]
We show that Transformers equipped with softmax-based nonlinear attention are uniformly consistent when tasked with executing Least Squares regression.<n>We derive upper bounds on the empirical error which, in the regime, decay at a provable rate of $mathcalO(t-1/2d)$, where $t$ denotes the number of input tokens and $d$ the embedding dimensionality.
arXiv Detail & Related papers (2025-05-30T12:39:26Z)
A distributional simplicity bias in the learning dynamics of transformers [50.91742043564049]
We show that transformers, trained on natural language data, also display a simplicity bias. Specifically, they sequentially learn many-body interactions among input tokens, reaching a saturation point in the prediction error for low-degree interactions. This approach opens up the possibilities of studying how interactions of different orders in the data affect learning, in natural language processing and beyond.
arXiv Detail & Related papers (2024-10-25T15:39:34Z)
Latent Variable Multi-output Gaussian Processes for Hierarchical Datasets [0.8057006406834466]
Multi-output Gaussian processes (MOGPs) have been introduced to deal with multiple tasks by exploiting the correlations between different outputs. This paper proposes an extension of MOGPs for hierarchical datasets.
arXiv Detail & Related papers (2023-08-31T15:52:35Z)
Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space. We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z)
DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained Diffusion [66.21290235237808]
We introduce an energy constrained diffusion model which encodes a batch of instances from a dataset into evolutionary states. We provide rigorous theory that implies closed-form optimal estimates for the pairwise diffusion strength among arbitrary instance pairs. Experiments highlight the wide applicability of our model as a general-purpose encoder backbone with superior performance in various tasks.
arXiv Detail & Related papers (2023-01-23T15:18:54Z)
Dynamic Latent Separation for Deep Learning [67.62190501599176]
A core problem in machine learning is to learn expressive latent variables for model prediction on complex data. Here, we develop an approach that improves expressiveness, provides partial interpretation, and is not restricted to specific applications.
arXiv Detail & Related papers (2022-10-07T17:56:53Z)
Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test. We train a variational inference model to predict the causal structure from observational/interventional data. Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z)
Learning Interacting Dynamical Systems with Latent Gaussian Process ODEs [13.436770170612295]
We study for the first time uncertainty-aware modeling of continuous-time dynamics of interacting objects. Our model infers both independent dynamics and their interactions with reliable uncertainty estimates.
arXiv Detail & Related papers (2022-05-24T08:36:25Z)
Scalable Gaussian Processes for Data-Driven Design using Big Data with Categorical Factors [14.337297795182181]
Gaussian processes (GP) have difficulties in accommodating big datasets, categorical inputs, and multiple responses. We propose a GP model that utilizes latent variables and functions obtained through variational inference to address the aforementioned challenges simultaneously. Our approach is demonstrated for machine learning of ternary oxide materials and topology optimization of a multiscale compliant mechanism.
arXiv Detail & Related papers (2021-06-26T02:17:23Z)
A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference. Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z)
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR) Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors. It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.