Transformers as Implicit State Estimators: In-Context Learning in Dynamical Systems
- URL: http://arxiv.org/abs/2410.16546v2
- Date: Fri, 31 Oct 2025 01:48:02 GMT
- Title: Transformers as Implicit State Estimators: In-Context Learning in Dynamical Systems
- Authors: Usman Akram, Haris Vikalo,
- Abstract summary: We show that transformers can implicitly infer hidden states in order to predict the outputs of a wide family of dynamical systems.<n>Findings suggest that transformer in-context learning provides a flexible, non-parametric alternative for output prediction in dynamical systems.
- Score: 18.634960596074027
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predicting the behavior of a dynamical system from noisy observations of its past outputs is a classical problem encountered across engineering and science. For linear systems with Gaussian inputs, the Kalman filter -- the best linear minimum mean-square error estimator of the state trajectory -- is optimal in the Bayesian sense. For nonlinear systems, Bayesian filtering is typically approached using suboptimal heuristics such as the Extended Kalman Filter (EKF), or numerical methods such as particle filtering (PF). In this work, we show that transformers, employed in an in-context learning (ICL) setting, can implicitly infer hidden states in order to predict the outputs of a wide family of dynamical systems, without test-time gradient updates or explicit knowledge of the system model. Specifically, when provided with a short context of past input-output pairs and, optionally, system parameters, a frozen transformer accurately predicts the current output. In linear-Gaussian regimes, its predictions closely match those of the Kalman filter; in nonlinear regimes, its performance approaches that of EKF and PF. Moreover, prediction accuracy degrades gracefully when key parameters, such as the state-transition matrix, are withheld from the context, demonstrating robustness and implicit parameter inference. These findings suggest that transformer in-context learning provides a flexible, non-parametric alternative for output prediction in dynamical systems, grounded in implicit latent-state estimation.
Related papers
- Robust Unscented Kalman Filtering via Recurrent Meta-Adaptation of Sigma-Point Weights [0.0]
This work introduces the Meta-Adaptive UKF (MA-UKF), a framework that reformulates sigma-point weight as a hyper parameter optimization problem.<n>Unlike standard adaptive filters that rely on instantaneous corrections, our approach employs a Recurrent Context to compress the history of measurement innovations into a compact latent embedding.<n> Numerical benchmarks on maneuvering targets demonstrate that the MA-UKF significantly outperforms standard baselines.
arXiv Detail & Related papers (2026-03-04T18:27:59Z) - Conditional Normalizing Flows for Forward and Backward Joint State and Parameter Estimation [0.0]
This study reviews recent approaches to normalizing state estimation via nonlinear filtering.<n>We investigate the performance of these approaches on applications relevant to autonomous driving and patient population dynamics.<n>Finally, we assess the performance of various conditioning strategies for an application to real-world COVID-19 joint SIR system forecasting and estimation.
arXiv Detail & Related papers (2026-01-11T18:01:42Z) - Disordered Dynamics in High Dimensions: Connections to Random Matrices and Machine Learning [52.26396748560348]
We provide an overview of high dimensional dynamical systems driven by random matrices.<n>We focus on applications to simple models of learning and generalization in machine learning theory.
arXiv Detail & Related papers (2026-01-03T00:12:32Z) - Assumed Density Filtering and Smoothing with Neural Network Surrogate Models [0.0]
We show that cross entropy is a more appropriate performance metric than RMSE for evaluating the accuracy of filter and smoothers.<n>We demonstrate the superiority of our method for state estimation on a Lorenz system and a Wiener system, and find that our method enables more optimal linear regulation when the state estimate is used for feedback.
arXiv Detail & Related papers (2025-11-12T06:08:53Z) - Efficient Transformed Gaussian Process State-Space Models for Non-Stationary High-Dimensional Dynamical Systems [49.819436680336786]
We propose an efficient transformed Gaussian process state-space model (ETGPSSM) for scalable and flexible modeling of high-dimensional, non-stationary dynamical systems.<n>Specifically, our ETGPSSM integrates a single shared GP with input-dependent normalizing flows, yielding an expressive implicit process prior that captures complex, non-stationary transition dynamics.<n>Our ETGPSSM outperforms existing GPSSMs and neural network-based SSMs in terms of computational efficiency and accuracy.
arXiv Detail & Related papers (2025-03-24T03:19:45Z) - Flow-based Bayesian filtering for high-dimensional nonlinear stochastic dynamical systems [4.382988524355736]
We propose a flow-based Bayesian filter (FBF) that integrates normalizing flows to construct a novel latent linear state-space model with Gaussian filtering distributions.<n>This framework facilitates efficient density estimation and sampling using invertible transformations provided by normalizing flows.<n> Numerical experiments demonstrate the superior accuracy and efficiency of FBF.
arXiv Detail & Related papers (2025-02-22T14:04:23Z) - (How) Can Transformers Predict Pseudo-Random Numbers? [7.201095605457193]
We study the ability of Transformers to learn pseudo-random number sequences from linear congruential generators (LCGs)
Our analysis reveals that Transformers can perform in-context prediction of LCG sequences with unseen moduli ($m$) and parameters ($a,c$)
arXiv Detail & Related papers (2025-02-14T18:59:40Z) - Exact Sequence Classification with Hardmax Transformers [0.0]
We prove that hardmax attention transformers perfectly classify datasets of $N$ labeled sequences in $mathbbRd$, $dgeq 2$.
Specifically, given $N$ sequences with an arbitrary but finite length in $mathbbRd$, we construct a transformer with $mathcalO(N)$ blocks and $mathcalO(Nd)$ parameters perfectly classifying this dataset.
arXiv Detail & Related papers (2025-02-04T12:31:00Z) - Provable In-context Learning for Mixture of Linear Regressions using Transformers [34.458004744956334]
We theoretically investigate the in-context learning capabilities of transformers in the context of learning mixtures of linear regression models.
For the case of two mixtures, we demonstrate the existence of transformers that can achieve an accuracy, relative to the oracle predictor, of order $mathcaltildeO((d/n)1/4)$ in the low signal-to-noise ratio (SNR) regime and $mathcaltildeO(sqrtd/n)$ in the high SNR regime.
arXiv Detail & Related papers (2024-10-18T05:28:47Z) - Can Transformers Learn $n$-gram Language Models? [77.35809823602307]
We study transformers' ability to learn random $n$-gram LMs of two kinds.
We find that classic estimation techniques for $n$-gram LMs such as add-$lambda$ smoothing outperform transformers.
arXiv Detail & Related papers (2024-10-03T21:21:02Z) - Higher-Order Transformer Derivative Estimates for Explicit Pathwise Learning Guarantees [9.305677878388664]
This paper fills a gap in the literature by precisely estimating all the higher-order derivatives of all orders for the transformer model.
We obtain fully-explicit estimates of all constants in terms of the number of attention heads, the depth and width of each transformer block, and the number of normalization layers.
We conclude that real-world transformers can learn from $N$ samples of a single Markov process's trajectory at a rate of $O(operatornamepolylog(N/sqrtN)$.
arXiv Detail & Related papers (2024-05-26T13:19:32Z) - How Do Transformers "Do" Physics? Investigating the Simple Harmonic Oscillator [15.01642959193149]
We investigate the simple harmonic oscillator (SHO), one of the most fundamental systems in physics.
We identify the methods transformers use to model the SHO, and to do so we hypothesize and evaluate possible methods by analyzing the encoding of these methods' intermediates.
Our analysis framework can conveniently extend to high-dimensional linear systems and nonlinear systems, which we hope will help reveal the "world model" hidden in transformers.
arXiv Detail & Related papers (2024-05-23T01:14:22Z) - Closed-form Filtering for Non-linear Systems [83.91296397912218]
We propose a new class of filters based on Gaussian PSD Models, which offer several advantages in terms of density approximation and computational efficiency.
We show that filtering can be efficiently performed in closed form when transitions and observations are Gaussian PSD Models.
Our proposed estimator enjoys strong theoretical guarantees, with estimation error that depends on the quality of the approximation and is adaptive to the regularity of the transition probabilities.
arXiv Detail & Related papers (2024-02-15T08:51:49Z) - How do Transformers perform In-Context Autoregressive Learning? [76.18489638049545]
We train a Transformer model on a simple next token prediction task.
We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping.
arXiv Detail & Related papers (2024-02-08T16:24:44Z) - p-Laplacian Transformer [7.2541371193810384]
$p$-Laplacian regularization, rooted in graph and image signal processing, introduces a parameter $p$ to control the regularization effect on these data.
We first show that the self-attention mechanism obtains the minimal Laplacian regularization.
We then propose a novel class of transformers, namely the $p$-Laplacian Transformer (p-LaT)
arXiv Detail & Related papers (2023-11-06T16:25:56Z) - From Spectral Theorem to Statistical Independence with Application to
System Identification [11.98319841778396]
We provide first quantitative handle on decay rate of finite powers of state transition matrix $|Ak|$.
It is shown that when a stable dynamical system has only one distinct eigenvalue and discrepancy of $n-1$: $|A|$ has a dependence on $n$, resulting dynamics are inseparable.
We show that element-wise error is essentially a variant of well-know Littlewood-Offord problem.
arXiv Detail & Related papers (2023-10-16T15:40:43Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - System Identification for Continuous-time Linear Dynamical Systems [0.7510165488300368]
Generalizing the learning of latent linear dynamical systems to continuous-time may extend the use of the hybrid Kalman filter.
We apply the method by learning the parameters of a latent, multivariate Fokker-Planck SDE representing a toggle-switch genetic circuit.
arXiv Detail & Related papers (2023-08-23T05:53:13Z) - Kalman Filter for Online Classification of Non-Stationary Data [101.26838049872651]
In Online Continual Learning (OCL) a learning system receives a stream of data and sequentially performs prediction and training steps.
We introduce a probabilistic Bayesian online learning model by using a neural representation and a state space model over the linear predictor weights.
In experiments in multi-class classification we demonstrate the predictive ability of the model and its flexibility to capture non-stationarity.
arXiv Detail & Related papers (2023-06-14T11:41:42Z) - Sampled Transformer for Point Sets [80.66097006145999]
sparse transformer can reduce the computational complexity of the self-attention layers to $O(n)$, whilst still being a universal approximator of continuous sequence-to-sequence functions.
We propose an $O(n)$ complexity sampled transformer that can process point set elements directly without any additional inductive bias.
arXiv Detail & Related papers (2023-02-28T06:38:05Z) - Computational Doob's h-transforms for Online Filtering of Discretely
Observed Diffusions [65.74069050283998]
We propose a computational framework to approximate Doob's $h$-transforms.
The proposed approach can be orders of magnitude more efficient than state-of-the-art particle filters.
arXiv Detail & Related papers (2022-06-07T15:03:05Z) - Stochastically forced ensemble dynamic mode decomposition for
forecasting and analysis of near-periodic systems [65.44033635330604]
We introduce a novel load forecasting method in which observed dynamics are modeled as a forced linear system.
We show that its use of intrinsic linear dynamics offers a number of desirable properties in terms of interpretability and parsimony.
Results are presented for a test case using load data from an electrical grid.
arXiv Detail & Related papers (2020-10-08T20:25:52Z) - A Random Matrix Analysis of Random Fourier Features: Beyond the Gaussian
Kernel, a Precise Phase Transition, and the Corresponding Double Descent [85.77233010209368]
This article characterizes the exacts of random Fourier feature (RFF) regression, in the realistic setting where the number of data samples $n$ is all large and comparable.
This analysis also provides accurate estimates of training and test regression errors for large $n,p,N$.
arXiv Detail & Related papers (2020-06-09T02:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.