Related papers: Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning

Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning

URL: http://arxiv.org/abs/2503.01329v2
Date: Wed, 16 Apr 2025 09:54:20 GMT
Title: Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning
Authors: Anh Tong, Thanh Nguyen-Tang, Dongeun Lee, Duc Nguyen, Toan Tran, David Hall, Cheongwoong Kang, Jaesik Choi,
Abstract summary: We introduce a novel approach to modeling transformer architectures using highly flexible non-autonomous neural ordinary differential equations (ODEs)<n>Our proposed model parameterizes all weights of attention and feed-forward blocks through neural networks, expressing these weights as functions of a continuous layer index.<n>Our neural ODE transformer demonstrates performance comparable to or better than vanilla transformers across various configurations and datasets.
Score: 30.781578037476347
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in large language models (LLMs) based on transformer architectures have sparked significant interest in understanding their inner workings. In this paper, we introduce a novel approach to modeling transformer architectures using highly flexible non-autonomous neural ordinary differential equations (ODEs). Our proposed model parameterizes all weights of attention and feed-forward blocks through neural networks, expressing these weights as functions of a continuous layer index. Through spectral analysis of the model's dynamics, we uncover an increase in eigenvalue magnitude that challenges the weight-sharing assumption prevalent in existing theoretical studies. We also leverage the Lyapunov exponent to examine token-level sensitivity, enhancing model interpretability. Our neural ODE transformer demonstrates performance comparable to or better than vanilla transformers across various configurations and datasets, while offering flexible fine-tuning capabilities that can adapt to different architectural constraints.

Related papers

Dynamical Properties of Tokens in Self-Attention and Effects of Positional Encoding [5.2482659629416535]
We characterize when tokens move closer to or farther from one another over time, depending on the model parameters.<n>We investigate how different forms of positional encoding -- specifically absolute and rotary -- affect these dynamical regimes.<n>Motivated by these insights, we propose simple refinements to Transformer architectures that mitigate convergence behavior in models with absolute or rotary positional encoding.
arXiv Detail & Related papers (2025-11-25T19:39:57Z)
Physics-Informed Learning of Proprietary Inverter Models for Grid Dynamic Studies [0.5999777817331317]
We develop a physics-informed neural ordinary differential equations-based framework to emulate the proprietary dynamics of the inverters.<n>The proposed method is validated using a grid-forming inverter (GFM) case study.
arXiv Detail & Related papers (2025-07-21T05:48:31Z)
Langevin Flows for Modeling Neural Latent Dynamics [81.81271685018284]
We introduce LangevinFlow, a sequential Variational Auto-Encoder where the time evolution of latent variables is governed by the underdamped Langevin equation.<n>Our approach incorporates physical priors -- such as inertia, damping, a learned potential function, and forces -- to represent both autonomous and non-autonomous processes in neural systems.<n>Our method outperforms state-of-the-art baselines on synthetic neural populations generated by a Lorenz attractor.
arXiv Detail & Related papers (2025-07-15T17:57:48Z)
Learning High-dimensional Ionic Model Dynamics Using Fourier Neural Operators [0.0]
We investigate whether Fourier Neural Operators can learn the evolution of all the state variables within ionic systems in higher dimensions.<n>We demonstrate the effectiveness of this approach by accurately learning the dynamics of three well-established ionic models with increasing dimensionality.<n>Both constrained and unconstrained architectures achieve comparable results in terms of accuracy across all the models considered.
arXiv Detail & Related papers (2025-05-20T07:37:03Z)
Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment. We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z)
Generalized Factor Neural Network Model for High-dimensional Regression [50.554377879576066]
We tackle the challenges of modeling high-dimensional data sets with latent low-dimensional structures hidden within complex, non-linear, and noisy relationships.<n>Our approach enables a seamless integration of concepts from non-parametric regression, factor models, and neural networks for high-dimensional regression.
arXiv Detail & Related papers (2025-02-16T23:13:55Z)
OT-Transformer: A Continuous-time Transformer Architecture with Optimal Transport Regularization [1.7180235064112577]
We consider a dynamical system whose governing equation is parametrized by transformer blocks. We leverage optimal transport theory to regularize the training problem, which enhances stability in training and improves generalization of the resulting model.
arXiv Detail & Related papers (2025-01-30T22:52:40Z)
The Asymptotic Behavior of Attention in Transformers [2.631744051718347]
The transformer architecture has become the foundation of modern Large Language Models (LLMs)<n>We show that all the tokens in a transformerally converge to a cluster as depth increases.<n>We then extend our analysis to autoregressive models, exploiting their structure to further generalize the theoretical guarantees.
arXiv Detail & Related papers (2024-12-03T18:54:49Z)
Latent Space Energy-based Neural ODEs [73.01344439786524]
This paper introduces novel deep dynamical models designed to represent continuous-time sequences.<n>We train the model using maximum likelihood estimation with Markov chain Monte Carlo.<n> Experimental results on oscillating systems, videos and real-world state sequences (MuJoCo) demonstrate that our model with the learnable energy-based prior outperforms existing counterparts.
arXiv Detail & Related papers (2024-09-05T18:14:22Z)
Modularity in Transformers: Investigating Neuron Separability & Specialization [0.0]
Transformer models are increasingly prevalent in various applications, yet our understanding of their internal workings remains limited. This paper investigates the modularity and task specialization of neurons within transformer architectures, focusing on both vision (ViT) and language (Mistral 7B) models. Using a combination of selective pruning and MoEfication clustering techniques, we analyze the overlap and specialization of neurons across different tasks and data subsets.
arXiv Detail & Related papers (2024-08-30T14:35:01Z)
A Unified Framework for Interpretable Transformers Using PDEs and Information Theory [3.4039202831583903]
This paper presents a novel unified theoretical framework for understanding Transformer architectures by integrating Partial Differential Equations (PDEs), Neural Information Flow Theory, and Information Bottleneck Theory. We model Transformer information dynamics as a continuous PDE process, encompassing diffusion, self-attention, and nonlinear residual components. Our comprehensive experiments across image and text modalities demonstrate that the PDE model effectively captures key aspects of Transformer behavior, achieving high similarity (cosine similarity > 0.98) with Transformer attention distributions across all layers.
arXiv Detail & Related papers (2024-08-18T16:16:57Z)
Dynamical Mean-Field Theory of Self-Attention Neural Networks [0.0]
Transformer-based models have demonstrated exceptional performance across diverse domains. Little is known about how they operate or what are their expected dynamics. We use methods for the study of asymmetric Hopfield networks in nonequilibrium regimes.
arXiv Detail & Related papers (2024-06-11T13:29:34Z)
Differential Evolution Algorithm based Hyper-Parameters Selection of Transformer Neural Network Model for Load Forecasting [0.0]
Transformer models have the potential to improve Load forecasting because of their ability to learn long-range dependencies derived from their Attention Mechanism. Our work compares the proposed Transformer based Neural Network model integrated with different metaheuristic algorithms by their performance in Load forecasting based on numerical metrics such as Mean Squared Error (MSE) and Mean Absolute Percentage Error (MAPE)
arXiv Detail & Related papers (2023-07-28T04:29:53Z)
Neural Operator with Regularity Structure for Modeling Dynamics Driven by SPDEs [70.51212431290611]
Partial differential equations (SPDEs) are significant tools for modeling dynamics in many areas including atmospheric sciences and physics. We propose the Neural Operator with Regularity Structure (NORS) which incorporates the feature vectors for modeling dynamics driven by SPDEs. We conduct experiments on various of SPDEs including the dynamic Phi41 model and the 2d Navier-Stokes equation.
arXiv Detail & Related papers (2022-04-13T08:53:41Z)
Equivariant vector field network for many-body system modeling [65.22203086172019]
Equivariant Vector Field Network (EVFN) is built on a novel equivariant basis and the associated scalarization and vectorization layers. We evaluate our method on predicting trajectories of simulated Newton mechanics systems with both full and partially observed data.
arXiv Detail & Related papers (2021-10-26T14:26:25Z)
Physics-Integrated Variational Autoencoders for Robust and Interpretable Generative Modeling [86.9726984929758]
We focus on the integration of incomplete physics models into deep generative models. We propose a VAE architecture in which a part of the latent space is grounded by physics. We demonstrate generative performance improvements over a set of synthetic and real-world datasets.
arXiv Detail & Related papers (2021-02-25T20:28:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.