Related papers: Dual Filter: A Mathematical Framework for Inference using Transformer-like Architectures

Dual Filter: A Mathematical Framework for Inference using Transformer-like Architectures

URL: http://arxiv.org/abs/2505.00818v1
Date: Thu, 01 May 2025 19:19:29 GMT
Title: Dual Filter: A Mathematical Framework for Inference using Transformer-like Architectures
Authors: Heng-Sheng Chang, Prashant G. Mehta,
Abstract summary: We present a framework for causal nonlinear prediction in settings where observations are generated from an underlying hidden Markov model (HMM)<n>Both the problem formulation and the proposed solution are motivated by the decoder-only transformer architecture.
Score: 1.9567015559455132
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents a mathematical framework for causal nonlinear prediction in settings where observations are generated from an underlying hidden Markov model (HMM). Both the problem formulation and the proposed solution are motivated by the decoder-only transformer architecture, in which a finite sequence of observations (tokens) is mapped to the conditional probability of the next token. Our objective is not to construct a mathematical model of a transformer. Rather, our interest lies in deriving, from first principles, transformer-like architectures that solve the prediction problem for which the transformer is designed. The proposed framework is based on an original optimal control approach, where the prediction objective (MMSE) is reformulated as an optimal control problem. An analysis of the optimal control problem is presented leading to a fixed-point equation on the space of probability measures. To solve the fixed-point equation, we introduce the dual filter, an iterative algorithm that closely parallels the architecture of decoder-only transformers. These parallels are discussed in detail along with the relationship to prior work on mathematical modeling of transformers as transport on the space of probability measures. Numerical experiments are provided to illustrate the performance of the algorithm using parameter values used in researchscale transformer models.

Related papers

A surrogate model for topology optimisation of elastic structures via parametric autoencoders [0.0]
Instead of learning the parametric solution of the state (and adjoint) problems, the proposed approach devises a surrogate version of the entire optimisation pipeline.<n>The method predicts a quasi-optimal topology for a given problem configuration as a surrogate model of high-fidelity topologies optimised with the homogenisation method.<n>Different architectures are proposed and the approximation and generalisation capabilities of the resulting models are numerically evaluated.
arXiv Detail & Related papers (2025-07-30T10:07:42Z)
On the Existence of Universal Simulators of Attention [17.01811978811789]
We present solutions to identically replicate attention outputs and the underlying elementary matrix and activation operations via RASP.<n>Our proofs, for the first time, show the existence of an algorithmically achievable data-agnostic solution, previously known to be approximated only by learning.
arXiv Detail & Related papers (2025-06-23T15:15:25Z)
Constrained belief updates explain geometric structures in transformer representations [0.0]
We integrate the model-agnostic theory of optimal prediction with mechanistic interpretability to analyze transformers trained on a tractable family of hidden Markov models.<n>We find that attention heads carry out an algorithm with a natural interpretation in the probability simplex, and create representations with distinctive geometric structure.
arXiv Detail & Related papers (2025-02-04T03:03:54Z)
Adaptive posterior distributions for uncertainty analysis of covariance matrices in Bayesian inversion problems for multioutput signals [0.0]
We address the problem of performing Bayesian inference for the parameters of a nonlinear multi-output model.<n>The variables of interest are split in two blocks and the inference takes advantage of known analytical optimization formulas.
arXiv Detail & Related papers (2025-01-02T09:01:09Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
EulerFormer: Sequential User Behavior Modeling with Complex Vector Attention [88.45459681677369]
We propose a novel transformer variant with complex vector attention, named EulerFormer. It provides a unified theoretical framework to formulate both semantic difference and positional difference. It is more robust to semantic variations and possesses moresuperior theoretical properties in principle.
arXiv Detail & Related papers (2024-03-26T14:18:43Z)
Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem. We characterize the implicit bias of 1-layer transformers optimized with gradient descent. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z)
Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications. The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate. There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z)
Transformers as Algorithms: Generalization and Implicit Model Selection in In-context Learning [23.677503557659705]
In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of examples and performs inference on-the-fly. We treat the transformer model as a learning algorithm that can be specialized via training to implement-at inference-time-another target algorithm. We show that transformers can act as an adaptive learning algorithm and perform model selection across different hypothesis classes.
arXiv Detail & Related papers (2023-01-17T18:31:12Z)
Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost [53.746169882193456]
Recent works have proposed various sparse attention modules to overcome the quadratic cost of self-attention. We propose a model that resolves both problems by endowing each attention head with a mixed-membership Block Model. Our model outperforms previous efficient variants as well as the original Transformer with full attention.
arXiv Detail & Related papers (2022-10-27T15:30:52Z)
EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation [104.44478403427881]
EdgeFormer is a parameter-efficient Transformer of the encoder-decoder architecture for on-device seq2seq generation. We conduct experiments on two practical on-device seq2seq tasks: Machine Translation and Grammatical Error Correction.
arXiv Detail & Related papers (2022-02-16T10:10:00Z)
Predicting Attention Sparsity in Transformers [0.9786690381850356]
We propose Sparsefinder, a model trained to identify the sparsity pattern of entmax attention before computing it. Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph.
arXiv Detail & Related papers (2021-09-24T20:51:21Z)
Bayesian learning of orthogonal embeddings for multi-fidelity Gaussian Processes [3.564709604457361]
"Projection" mapping consists of an orthonormal matrix that is considered a priori unknown and needs to be inferred jointly with the GP parameters. We extend the proposed framework to multi-fidelity models using GPs including the scenarios of training multiple outputs together. The benefits of our proposed framework, are illustrated on the computationally challenging three-dimensional aerodynamic optimization of a last-stage blade for an industrial gas turbine.
arXiv Detail & Related papers (2020-08-05T22:28:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.