Related papers: An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

URL: http://arxiv.org/abs/2212.14852v3
Date: Mon, 1 Apr 2024 03:51:06 GMT
Title: An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models
Authors: Yufeng Zhang, Boyi Liu, Qi Cai, Lingxiao Wang, Zhaoran Wang,
Abstract summary: We show that input tokens are often exchangeable since they already include positional encodings. We establish the existence of a sufficient and minimal representation of input tokens. We prove that attention with the desired parameter infers the latent posterior up to an approximation error.
Score: 64.87562101662952
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: With the attention mechanism, transformers achieve significant empirical successes. Despite the intuitive understanding that transformers perform relational inference over long sequences to produce desirable representations, we lack a rigorous theory on how the attention mechanism achieves it. In particular, several intriguing questions remain open: (a) What makes a desirable representation? (b) How does the attention mechanism infer the desirable representation within the forward pass? (c) How does a pretraining procedure learn to infer the desirable representation through the backward pass? We observe that, as is the case in BERT and ViT, input tokens are often exchangeable since they already include positional encodings. The notion of exchangeability induces a latent variable model that is invariant to input sizes, which enables our theoretical analysis. - To answer (a) on representation, we establish the existence of a sufficient and minimal representation of input tokens. In particular, such a representation instantiates the posterior distribution of the latent variable given input tokens, which plays a central role in predicting output labels and solving downstream tasks. - To answer (b) on inference, we prove that attention with the desired parameter infers the latent posterior up to an approximation error, which is decreasing in input sizes. In detail, we quantify how attention approximates the conditional mean of the value given the key, which characterizes how it performs relational inference over long sequences. - To answer (c) on learning, we prove that both supervised and self-supervised objectives allow empirical risk minimization to learn the desired parameter up to a generalization error, which is independent of input sizes. Particularly, in the self-supervised setting, we identify a condition number that is pivotal to solving downstream tasks.

Related papers

Transformers Are Born Biased: Structural Inductive Biases at Random Initialization and Their Practical Consequences [29.509249228044492]
We show that randomly untrained models display extreme token preferences across random input sequences.<n>We show that extreme token preference arises from a contraction of token representations along a random seed-dependent direction.
arXiv Detail & Related papers (2026-02-05T17:37:41Z)
Credal Transformer: A Principled Approach for Quantifying and Mitigating Hallucinations in Large Language Models [9.660348625678001]
Large Language Models (LLMs) hallucinate, generating factually incorrect yet confident assertions.<n>We introduce the Credal Transformer, which replaces standard attention with a Credal Attention Mechanism (CAM) based on evidential theory.
arXiv Detail & Related papers (2025-10-14T04:31:49Z)
Provable In-Context Learning of Nonlinear Regression with Transformers [58.018629320233174]
In-context learning (ICL) is the ability to perform unseen tasks using task-specific prompts without updating parameters.<n>Recent research has actively explored the training dynamics behind ICL.<n>This paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities.
arXiv Detail & Related papers (2025-07-28T00:09:28Z)
Transformers Learn Faster with Semantic Focus [57.97235825738412]
We study sparse transformers in terms of learnability and generalization.<n>We find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models.
arXiv Detail & Related papers (2025-06-17T01:19:28Z)
Interpretable representation learning of quantum data enabled by probabilistic variational autoencoders [0.5999777817331317]
variational autoencoders (VAEs) have shown promise in extracting the hidden physical features of some input data.<n>VAEs must account for its intrinsic randomness and complex correlations when dealing with quantum data.<n>Here, we demonstrate that two key modifications enable VAEs to learn physically meaningful latent representations.
arXiv Detail & Related papers (2025-06-13T17:39:41Z)
Sequence Complementor: Complementing Transformers For Time Series Forecasting with Learnable Sequences [5.244482076690776]
We find that expressive capability of sequence representation is a key factor influencing Transformer performance in time forecasting. We propose a novel attention mechanism with Sequence Complementors and prove feasible from an information theory perspective.
arXiv Detail & Related papers (2025-01-06T03:08:39Z)
Identifiable Representation and Model Learning for Latent Dynamic Systems [0.0]
We study the problem of identifiable representation and model learning for latent dynamic systems. We prove that, for linear or affine nonlinear latent dynamic systems, it is possible to identify the representations up to scaling and determine the models up to some simple transformations.
arXiv Detail & Related papers (2024-10-23T13:55:42Z)
Are queries and keys always relevant? A case study on Transformer wave functions [0.0]
dot product attention mechanism, originally designed for natural language processing (NLP) tasks, is a cornerstone of modern Transformers. We explore the suitability of Transformers, focusing on their attention mechanisms, in the specific domain of the parametrization of variational wave functions.
arXiv Detail & Related papers (2024-05-29T08:32:37Z)
A Mechanism for Sample-Efficient In-Context Learning for Sparse Retrieval Tasks [29.764014766305174]
We show how a transformer model is able to perform ICL under reasonable assumptions on the pre-training process and the downstream tasks. We establish that this entire procedure is implementable using the transformer mechanism.
arXiv Detail & Related papers (2023-05-26T15:49:43Z)
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering [63.87200781247364]
Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations. We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
arXiv Detail & Related papers (2022-09-14T22:04:10Z)
Causally-motivated Shortcut Removal Using Auxiliary Labels [63.686580185674195]
Key challenge to learning such risk-invariant predictors is shortcut learning. We propose a flexible, causally-motivated approach to address this challenge. We show both theoretically and empirically that this causally-motivated regularization scheme yields robust predictors.
arXiv Detail & Related papers (2021-05-13T16:58:45Z)
Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)
Tractable Inference in Credal Sentential Decision Diagrams [116.6516175350871]
Probabilistic sentential decision diagrams are logic circuits where the inputs of disjunctive gates are annotated by probability values. We develop the credal sentential decision diagrams, a generalisation of their probabilistic counterpart that allows for replacing the local probabilities with credal sets of mass functions. For a first empirical validation, we consider a simple application based on noisy seven-segment display images.
arXiv Detail & Related papers (2020-08-19T16:04:34Z)
Learning Disentangled Representations with Latent Variation Predictability [102.4163768995288]
This paper defines the variation predictability of latent disentangled representations. Within an adversarial generation process, we encourage variation predictability by maximizing the mutual information between latent variations and corresponding image pairs. We develop an evaluation metric that does not rely on the ground-truth generative factors to measure the disentanglement of latent representations.
arXiv Detail & Related papers (2020-07-25T08:54:26Z)
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR) Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors. It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.