Related papers: Transformers Simulate MLE for Sequence Generation in Bayesian Networks

Transformers Simulate MLE for Sequence Generation in Bayesian Networks

URL: http://arxiv.org/abs/2501.02547v1
Date: Sun, 05 Jan 2025 13:56:51 GMT
Title: Transformers Simulate MLE for Sequence Generation in Bayesian Networks
Authors: Yuan Cao, Yihan He, Dennis Wu, Hong-Yu Chen, Jianqing Fan, Han Liu,
Abstract summary: We investigate the theoretical capabilities of transformers to autoregressively generate sequences in Bayesian networks based on in-context maximum likelihood estimation (MLE)<n>We demonstrate that there exists a simple transformer model that can estimate the conditional probabilities of the Bayesian network according to the context.<n>We further demonstrate in extensive experiments that such a transformer does not only exist in theory, but can also be effectively obtained through training.
Score: 18.869174453242383
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Transformers have achieved significant success in various fields, notably excelling in tasks involving sequential data like natural language processing. Despite these achievements, the theoretical understanding of transformers' capabilities remains limited. In this paper, we investigate the theoretical capabilities of transformers to autoregressively generate sequences in Bayesian networks based on in-context maximum likelihood estimation (MLE). Specifically, we consider a setting where a context is formed by a set of independent sequences generated according to a Bayesian network. We demonstrate that there exists a simple transformer model that can (i) estimate the conditional probabilities of the Bayesian network according to the context, and (ii) autoregressively generate a new sample according to the Bayesian network with estimated conditional probabilities. We further demonstrate in extensive experiments that such a transformer does not only exist in theory, but can also be effectively obtained through training. Our analysis highlights the potential of transformers to learn complex probabilistic models and contributes to a better understanding of large language models as a powerful class of sequence generators.

Related papers

On the Expressive Power of Transformers for Maxout Networks and Continuous Piecewise Linear Functions [8.192218166714422]
Transformer networks have achieved remarkable empirical success across a wide range of applications, yet their theoretical expressive power remains insufficiently understood.<n>We first establish an explicit approximation of maxout networks by Transformer networks while preserving comparable model complexity.<n>As a consequence, Transformers inherit the universal approximation capability of ReLU networks under similar complexity constraints.
arXiv Detail & Related papers (2026-03-03T15:27:15Z)
Transformers in Pseudo-Random Number Generation: A Dual Perspective on Theory and Practice [1.8725832935669624]
Pseudo-random number generators (PRNGs) are high-nonlinear processes, and they are key blocks in optimization of Large language models.<n>We show that it is reasonable to generate high-quality pseudo-random numbers based on transformers.
arXiv Detail & Related papers (2025-08-02T01:31:53Z)
Universal Approximation Theorem for a Single-Layer Transformer [0.0]
Deep learning employs multi-layer neural networks trained via the backpropagation algorithm.<n>Transformers have achieved state-of-the-art performance in natural language processing.<n>We prove that a single-layer Transformer, comprising one self-attention layer followed by a position-wise feed-forward network with ReLU activation, can any continuous sequence-to-sequence mapping on a compact domain to arbitrary precision.
arXiv Detail & Related papers (2025-07-11T11:37:39Z)
On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks. We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z)
Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot [50.16171384920963]
transformer architecture has prevailed in various deep learning settings. One-layer transformer trained with gradient descent provably learns the sparse token selection task.
arXiv Detail & Related papers (2024-06-11T02:15:53Z)
Trained Transformers Learn Linear Models In-Context [39.56636898650966]
Attention-based neural networks as transformers have demonstrated a remarkable ability to exhibit inattention learning (ICL) We show that when transformer training over random instances of linear regression problems, these models' predictions mimic nonlinear of ordinary squares.
arXiv Detail & Related papers (2023-06-16T15:50:03Z)
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL. We show that transformers can implement a broad class of standard machine learning algorithms in context. A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z)
Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z)
Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation. We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters. A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z)
Variational Transformers for Diverse Response Generation [71.53159402053392]
Variational Transformer (VT) is a variational self-attentive feed-forward sequence model. VT combines the parallelizability and global receptive field computation of the Transformer with the variational nature of the CVAE. We explore two types of VT: 1) modeling the discourse-level diversity with a global latent variable; and 2) augmenting the Transformer decoder with a sequence of finegrained latent variables.
arXiv Detail & Related papers (2020-03-28T07:48:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.