Related papers: Dynamical Mean-Field Theory of Self-Attention Neural Networks

Dynamical Mean-Field Theory of Self-Attention Neural Networks

URL: http://arxiv.org/abs/2406.07247v1
Date: Tue, 11 Jun 2024 13:29:34 GMT
Title: Dynamical Mean-Field Theory of Self-Attention Neural Networks
Authors: Ángel Poc-López, Miguel Aguilera,
Abstract summary: Transformer-based models have demonstrated exceptional performance across diverse domains. Little is known about how they operate or what are their expected dynamics. We use methods for the study of asymmetric Hopfield networks in nonequilibrium regimes.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer-based models have demonstrated exceptional performance across diverse domains, becoming the state-of-the-art solution for addressing sequential machine learning problems. Even though we have a general understanding of the fundamental components in the transformer architecture, little is known about how they operate or what are their expected dynamics. Recently, there has been an increasing interest in exploring the relationship between attention mechanisms and Hopfield networks, promising to shed light on the statistical physics of transformer networks. However, to date, the dynamical regimes of transformer-like models have not been studied in depth. In this paper, we address this gap by using methods for the study of asymmetric Hopfield networks in nonequilibrium regimes --namely path integral methods over generating functionals, yielding dynamics governed by concurrent mean-field variables. Assuming 1-bit tokens and weights, we derive analytical approximations for the behavior of large self-attention neural networks coupled to a softmax output, which become exact in the large limit size. Our findings reveal nontrivial dynamical phenomena, including nonequilibrium phase transitions associated with chaotic bifurcations, even for very simple configurations with a few encoded features and a very short context window. Finally, we discuss the potential of our analytic approach to improve our understanding of the inner workings of transformer models, potentially reducing computational training costs and enhancing model interpretability.

Related papers

Langevin Flows for Modeling Neural Latent Dynamics [81.81271685018284]
We introduce LangevinFlow, a sequential Variational Auto-Encoder where the time evolution of latent variables is governed by the underdamped Langevin equation.<n>Our approach incorporates physical priors -- such as inertia, damping, a learned potential function, and forces -- to represent both autonomous and non-autonomous processes in neural systems.<n>Our method outperforms state-of-the-art baselines on synthetic neural populations generated by a Lorenz attractor.
arXiv Detail & Related papers (2025-07-15T17:57:48Z)
Exact Learning Dynamics of In-Context Learning in Linear Transformers and Its Application to Non-Linear Transformers [1.7034813545878589]
Transformer models exhibit remarkable in-context learning (ICL) Our work offers an exact dynamical model for ICL and theoretically grounded tools for analyzing complex transformer training.
arXiv Detail & Related papers (2025-04-17T13:05:33Z)
Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning [30.781578037476347]
We introduce a novel approach to modeling transformer architectures using highly flexible non-autonomous neural ordinary differential equations (ODEs) Our proposed model parameterizes all weights of attention and feed-forward blocks through neural networks, expressing these weights as functions of a continuous layer index. Our neural ODE transformer demonstrates performance comparable to or better than vanilla transformers across various configurations and datasets.
arXiv Detail & Related papers (2025-03-03T09:12:14Z)
Multi-Physics Simulations via Coupled Fourier Neural Operator [9.839064047196114]
We introduce a novel coupled multi-physics neural operator learning (COMPOL) framework to model interactions among multiple physical processes. Our approach implements feature aggregation through recurrent and attention mechanisms, enabling comprehensive modeling of coupled interactions. Our proposed model demonstrates a two to three-fold improvement in predictive performance compared to existing approaches.
arXiv Detail & Related papers (2025-01-28T20:58:55Z)
Identification of Mean-Field Dynamics using Transformers [3.8916312075738273]
This paper investigates the use of transformer architectures to approximate the mean-field dynamics of particle systems exhibiting collective behavior. Specifically, we prove that if a finite-dimensional transformer can effectively approximate the finite-dimensional vector field governing the particle system, then the expected output of this transformer provides a good approximation for the infinite-dimensional mean-field vector field.
arXiv Detail & Related papers (2024-10-06T19:47:24Z)
Transformers from Diffusion: A Unified Framework for Neural Message Passing [79.9193447649011]
Message passing neural networks (MPNNs) have become a de facto class of model solutions.<n>We propose an energy-constrained diffusion model, which integrates the inductive bias of diffusion with layer-wise constraints of energy.<n>Building on these insights, we devise a new class of message passing models, dubbed Transformers (DIFFormer), whose global attention layers are derived from the principled energy-constrained diffusion framework.
arXiv Detail & Related papers (2024-09-13T17:54:41Z)
In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent. For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z)
Differential Evolution Algorithm based Hyper-Parameters Selection of Transformer Neural Network Model for Load Forecasting [0.0]
Transformer models have the potential to improve Load forecasting because of their ability to learn long-range dependencies derived from their Attention Mechanism. Our work compares the proposed Transformer based Neural Network model integrated with different metaheuristic algorithms by their performance in Load forecasting based on numerical metrics such as Mean Squared Error (MSE) and Mean Absolute Percentage Error (MAPE)
arXiv Detail & Related papers (2023-07-28T04:29:53Z)
Learning minimal representations of stochastic processes with variational autoencoders [52.99137594502433]
We introduce an unsupervised machine learning approach to determine the minimal set of parameters required to describe a process. Our approach enables for the autonomous discovery of unknown parameters describing processes.
arXiv Detail & Related papers (2023-07-21T14:25:06Z)
Capturing dynamical correlations using implicit neural representations [85.66456606776552]
We develop an artificial intelligence framework which combines a neural network trained to mimic simulated data from a model Hamiltonian with automatic differentiation to recover unknown parameters from experimental data. In doing so, we illustrate the ability to build and train a differentiable model only once, which then can be applied in real-time to multi-dimensional scattering data.
arXiv Detail & Related papers (2023-04-08T07:55:36Z)
Convexifying Transformers: Improving optimization and understanding of transformer networks [56.69983975369641]
We study the training problem of attention/transformer networks and introduce a novel convex analytic approach. We first introduce a convex alternative to the self-attention mechanism and reformulate the regularized training problem of transformer networks. As a byproduct of our convex analysis, we reveal an implicit regularization mechanism, which promotes sparsity across tokens.
arXiv Detail & Related papers (2022-11-20T18:17:47Z)
Learning Interacting Dynamical Systems with Latent Gaussian Process ODEs [13.436770170612295]
We study for the first time uncertainty-aware modeling of continuous-time dynamics of interacting objects. Our model infers both independent dynamics and their interactions with reliable uncertainty estimates.
arXiv Detail & Related papers (2022-05-24T08:36:25Z)
Equivariant vector field network for many-body system modeling [65.22203086172019]
Equivariant Vector Field Network (EVFN) is built on a novel equivariant basis and the associated scalarization and vectorization layers. We evaluate our method on predicting trajectories of simulated Newton mechanics systems with both full and partially observed data.
arXiv Detail & Related papers (2021-10-26T14:26:25Z)
Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear. We show that it commonly arises in parameters of discrete multiplicative noise due to variance. A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)
Deep learning of contagion dynamics on complex networks [0.0]
We propose a complementary approach based on deep learning to build effective models of contagion dynamics on networks. By allowing simulations on arbitrary network structures, our approach makes it possible to explore the properties of the learned dynamics beyond the training data. Our results demonstrate how deep learning offers a new and complementary perspective to build effective models of contagion dynamics on networks.
arXiv Detail & Related papers (2020-06-09T17:18:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.