Related papers: LLMs are Bayesian, in Expectation, not in Realization

LLMs are Bayesian, in Expectation, not in Realization

URL: http://arxiv.org/abs/2507.11768v1
Date: Tue, 15 Jul 2025 22:20:11 GMT
Title: LLMs are Bayesian, in Expectation, not in Realization
Authors: Leon Chlon, Sarah Rashidi, Zein Khamis, MarcAntonio M. Awada,
Abstract summary: Large language models adapt to new tasks without parameter updates.<n>Recent empirical findings reveal a fundamental contradiction: transformers systematically violate the martingale property.<n>This violation challenges the theoretical foundations underlying uncertainty quantification in critical applications.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models demonstrate remarkable in-context learning capabilities, adapting to new tasks without parameter updates. While this phenomenon has been successfully modeled as implicit Bayesian inference, recent empirical findings reveal a fundamental contradiction: transformers systematically violate the martingale property, a cornerstone requirement of Bayesian updating on exchangeable data. This violation challenges the theoretical foundations underlying uncertainty quantification in critical applications. Our theoretical analysis establishes four key results: (1) positional encodings induce martingale violations of order $\Theta(\log n / n)$; (2) transformers achieve information-theoretic optimality with excess risk $O(n^{-1/2})$ in expectation over orderings; (3) the implicit posterior representation converges to the true Bayesian posterior in the space of sufficient statistics; and (4) we derive the optimal chain-of-thought length as $k^* = \Theta(\sqrt{n}\log(1/\varepsilon))$ with explicit constants, providing a principled approach to reduce inference costs while maintaining performance. Empirical validation on GPT-3 confirms predictions (1)-(3), with transformers reaching 99\% of theoretical entropy limits within 20 examples. Our framework provides practical methods for extracting calibrated uncertainty estimates from position-aware architectures and optimizing computational efficiency in deployment.

Related papers

Causal Direction from Convergence Time: Faster Training in the True Causal Direction [0.0]
We introduce Causal Computational Asymmetry (CCA), a principle for causal direction identification based on optimization dynamics.<n>CCA operates in optimization-time space, distinguishing it from methods such as RESIT, IGCI, and SkewScore.<n>We further embed CCA into a broader framework termed Causal Compression Learning (CCL), which integrates graph structure learning, causal information compression, and policy optimization.
arXiv Detail & Related papers (2026-02-24T21:34:57Z)
Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol [69.11739400975445]
We introduce the first theoretical framework for analyzing error accumulation in Model Context Protocol (MCP) agents.<n>We show that cumulative distortion exhibits linear growth and high-probability deviations bounded by $O(sqrtT)$.<n>Key findings include: semantic weighting reduces distortion by 80%, and periodic re-grounding approximately every 9 steps suffices for error control.
arXiv Detail & Related papers (2026-02-10T21:08:53Z)
Variational Entropic Optimal Transport [67.76725267984578]
We propose Variational Entropic Optimal Transport (VarEOT) for domain translation problems.<n>VarEOT is based on an exact variational reformulation of the log-partition $log mathbbE[exp(cdot)$ as a tractable generalization over an auxiliary positive normalizer.<n> Experiments on synthetic data and unpaired image-to-image translation demonstrate competitive or improved translation quality.
arXiv Detail & Related papers (2026-02-02T15:48:44Z)
Parametric Mean-Field empirical Bayes in high-dimensional linear regression [8.197187859375694]
We characterize a sharp phase transition behavior for the Empirical Bayes (vEB) estimator.<n>In the first regime, we show how the estimated prior can be calibrated to enable valid coordinate-wise and delocalized inference.<n>In the second regime, we propose a debiasing technique as a way to improve the performance of the vEB estimator.
arXiv Detail & Related papers (2026-01-23T15:44:01Z)
Closed-Loop Transformers: Autoregressive Modeling as Iterative Latent Equilibrium [0.6820746164515952]
We introduce the closed-loop prediction principle, which requires that models iteratively refine latent representations until reaching a self-consistent equilibrium.<n>We instantiate this principle as Equilibrium Transformers, which augment standard transformer layers with an Equilibrium Refinement Module.<n>Preliminary experiments on the binary parity task demonstrate +3.28% average improvement on challenging sequences, with gains reaching +8.07% where standard transformers approach random performance.
arXiv Detail & Related papers (2025-11-26T20:02:59Z)
Pointwise confidence estimation in the non-linear $\ell^2$-regularized least squares [12.352761060862072]
We consider a high-probability non-asymptotic confidence estimation in the $ell2$-regularized non-linear least-squares setting with fixed design.<n>We show a pointwise confidence bound, meaning that it holds for the prediction on any given fixed test input $x$.
arXiv Detail & Related papers (2025-06-08T11:23:49Z)
Born a Transformer -- Always a Transformer? [57.37263095476691]
We study a family of $textitretrieval$ and $textitcopying$ tasks inspired by Liu et al.<n>We observe an $textitinduction-versus-anti-induction$ asymmetry, where pretrained models are better at retrieving tokens to the right (induction) than the left (anti-induction) of a query token.<n>Mechanistic analysis reveals that this asymmetry is connected to the differences in the strength of induction versus anti-induction circuits within pretrained transformers.
arXiv Detail & Related papers (2025-05-27T21:36:50Z)
Beyond Progress Measures: Theoretical Insights into the Mechanism of Grokking [50.465604300990904]
Grokking refers to the abrupt improvement in test accuracy after extended overfitting.<n>In this work, we investigate the grokking mechanism underlying the Transformer in the task of prime number operations.
arXiv Detail & Related papers (2025-04-04T04:42:38Z)
Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification [50.717692060500696]
Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling.<n>Next-token prediction can be made robust so as to achieve $C=tilde O(H)$, representing moderate error amplification.<n>No computationally efficient algorithm can achieve sub-polynomial approximation factor $C=e(log H)1-Omega(1)$.
arXiv Detail & Related papers (2025-02-18T02:52:00Z)
Towards Better Understanding of In-Context Learning Ability from In-Context Uncertainty Quantification [7.869708570399577]
We consider a bi-objective prediction task of predicting both the conditional expectation $mathbbE[Y|X]$ and the conditional variance Var$(Y|X)$. Theoretically, we show that the trained Transformer reaches near Bayes-optimum, suggesting the usage of the information of the training distribution.
arXiv Detail & Related papers (2024-05-24T00:08:55Z)
PROMISE: Preconditioned Stochastic Optimization Methods by Incorporating Scalable Curvature Estimates [17.777466668123886]
We introduce PROMISE ($textbfPr$econditioned $textbfO$ptimization $textbfM$ethods by $textbfI$ncorporating $textbfS$calable Curvature $textbfE$stimates), a suite of sketching-based preconditioned gradient algorithms. PROMISE includes preconditioned versions of SVRG, SAGA, and Katyusha.
arXiv Detail & Related papers (2023-09-05T07:49:10Z)
Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem. We characterize the implicit bias of 1-layer transformers optimized with gradient descent. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z)
Improved Convergence of Score-Based Diffusion Models via Prediction-Correction [15.772322871598085]
Score-based generative models (SGMs) are powerful tools to sample from complex data distributions. This paper addresses the issue by considering a version of the popular predictor-corrector scheme. We first estimate the final distribution via an inexact Langevin dynamics and then revert the process.
arXiv Detail & Related papers (2023-05-23T15:29:09Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
Stability and Risk Bounds of Iterative Hard Thresholding [41.082982732100696]
We introduce a novel sparse generalization theory for IHT under the notion of algorithmic stability. We show that IHT with sparsity level $k$ enjoys an $mathcaltilde O(n-1/2sqrtlog(n)log(p))$ rate of convergence in sparse excess risk. Preliminary numerical evidence is provided to confirm our theoretical predictions.
arXiv Detail & Related papers (2022-03-17T16:12:56Z)
Robust Implicit Networks via Non-Euclidean Contractions [63.91638306025768]
Implicit neural networks show improved accuracy and significant reduction in memory consumption. They can suffer from ill-posedness and convergence instability. This paper provides a new framework to design well-posed and robust implicit neural networks.
arXiv Detail & Related papers (2021-06-06T18:05:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.