Continuous-Time Attention: PDE-Guided Mechanisms for Long-Sequence Transformers
- URL: http://arxiv.org/abs/2505.20666v1
- Date: Tue, 27 May 2025 03:30:10 GMT
- Title: Continuous-Time Attention: PDE-Guided Mechanisms for Long-Sequence Transformers
- Authors: Yukun Zhang, Xueqing Zhou,
- Abstract summary: We propose a novel framework, Continuous_Time Attention, which infuses partial differential equations (PDEs) into the Transformer's attention mechanism.<n>We show that PDE_based attention leads to better optimization landscapes and enhances gradient flow.<n>Our findings highlight the potential of PDE_based formulations to enrich attention mechanisms with continuous_time dynamics and global coherence.
- Score: 3.2266392324513267
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel framework, Continuous_Time Attention, which infuses partial differential equations (PDEs) into the Transformer's attention mechanism to address the challenges of extremely long input sequences. Instead of relying solely on a static attention matrix, we allow attention weights to evolve over a pseudo_time dimension via diffusion, wave, or reaction_diffusion dynamics. This mechanism systematically smooths local noise, enhances long_range dependencies, and stabilizes gradient flow. Theoretically, our analysis shows that PDE_based attention leads to better optimization landscapes and polynomial rather than exponential decay of distant interactions. Empirically, we benchmark our method on diverse experiments_demonstrating consistent gains over both standard and specialized long sequence Transformer variants. Our findings highlight the potential of PDE_based formulations to enrich attention mechanisms with continuous_time dynamics and global coherence.
Related papers
- KoopGen: Koopman Generator Networks for Representing and Predicting Dynamical Systems with Continuous Spectra [65.11254608352982]
We introduce a generator-based neural Koopman framework that models dynamics through a structured, state-dependent representation of Koopman generators.<n>By exploiting the intrinsic Cartesian decomposition into skew-adjoint and self-adjoint components, KoopGen separates conservative transport from irreversible dissipation.
arXiv Detail & Related papers (2026-02-15T06:32:23Z) - Parallel Complex Diffusion for Scalable Time Series Generation [50.01609741902786]
PaCoDi is a spectral-native architecture that decouples generative modeling in the frequency domain.<n>We show that PaCoDi outperforms existing baselines in both generation quality and inference speed.
arXiv Detail & Related papers (2026-02-10T14:31:53Z) - SEED: Spectral Entropy-Guided Evaluation of SpatialTemporal Dependencies for Multivariate Time Series Forecasting [8.507253633170947]
We develop a Spectral Entropy-guided Evaluation framework for spatial-temporal Dependency modeling.<n>SEED provides a preliminary evaluation of the spatial and temporal dependencies of each variable, enabling the model to adaptively balance Channel Independence (CI) and Channel Dependence (CD) strategies.<n>SEED achieves state-of-the-art performance, validating its effectiveness and generality.
arXiv Detail & Related papers (2025-12-09T06:18:05Z) - RainDiff: End-to-end Precipitation Nowcasting Via Token-wise Attention Diffusion [64.49056527678606]
We propose a Token-wise Attention integrated into not only the U-Net diffusion model but also the radar-temporal encoder.<n>Unlike prior approaches, our method integrates attention into the architecture without incurring the high resource cost typical of pixel-space diffusion.<n>Our experiments and evaluations demonstrate that the proposed method significantly outperforms state-of-the-art approaches, robustness local fidelity, generalization, and superior in complex precipitation forecasting scenarios.
arXiv Detail & Related papers (2025-10-16T17:59:13Z) - PDE-Transformer: A Continuous Dynamical Systems Approach to Sequence Modeling [4.1812935375151925]
We propose PDE-Transformer, a sequence modeling paradigm that casts the forward pass of a Transformer as the numerical discretization of a continuous reaction-diffusion system.<n>In our framework, token embeddings evolve under a partial differential equation whose nonlocal integral term models self-attention.<n>We design an Adaptive PDE Diffusion Layer that enforces local smoothness in feature space with linear time complexity.
arXiv Detail & Related papers (2025-09-27T08:58:47Z) - Attention as an Adaptive Filter [0.0]
We introduce Adaptive Filter Attention (AFA), a novel attention mechanism that incorporates a learnable dynamics model directly into computation of attention weights.<n>By assuming a continuous-time linear time-invariant system, we can make use of a closed-form solution of the differential Lyapunov equation to efficiently propagate uncertainties through the dynamics from keys to queries.<n>A generalization of attention naturally arises as the likelihood maximum solution for filtering the trajectory of this linear SDE, with attention weights corresponding to robust residual-based reweightings of the propagated query-key precisions.
arXiv Detail & Related papers (2025-09-04T12:29:14Z) - ENMA: Tokenwise Autoregression for Generative Neural PDE Operators [12.314585849869797]
We introduce ENMA, a generative neural-temporal operator designed to model dynamics arising from physical phenomena.<n>ENMA predicts future dynamics compressed latent space using a generative masked autoregressive transformer trained with flow matching loss.<n>The framework generalizes to new PDE regimes and supports one-shot surrogate modeling of time-dependent parametric PDEs.
arXiv Detail & Related papers (2025-06-06T15:25:14Z) - StFT: Spatio-temporal Fourier Transformer for Long-term Dynamics Prediction [10.64762092324374]
We propose an autoregressive Spatio-temporal Transformer (FTStours) to learn the system dynamics at a distinct scale.<n>FTStours captures the underlying dynamics across both macro- and micro- spatial scales.<n> Evaluations conducted on three benchmark datasets demonstrate the advantages of our approach over state-of-the-art ML methods.
arXiv Detail & Related papers (2025-03-14T22:04:03Z) - A Unified Perspective on the Dynamics of Deep Transformers [24.094975798576783]
We study the evolution of data anisotropy through a deep Transformer.<n>We highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.
arXiv Detail & Related papers (2025-01-30T13:04:54Z) - Sequence Complementor: Complementing Transformers For Time Series Forecasting with Learnable Sequences [5.244482076690776]
We find that expressive capability of sequence representation is a key factor influencing Transformer performance in time forecasting.<n>We propose a novel attention mechanism with Sequence Complementors and prove feasible from an information theory perspective.
arXiv Detail & Related papers (2025-01-06T03:08:39Z) - Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints [51.83081671798784]
Diffusion Transformers (DiT) have emerged as a powerful architecture for image and video generation, offering superior quality and scalability.<n>DiT's practical application suffers from inherent dynamic feature instability, leading to error amplification during cached inference.<n>We propose Skip-DiT, a novel DiT variant enhanced with Long-Skip-Connections (LSCs) - the key efficiency component in U-Nets.
arXiv Detail & Related papers (2024-11-26T17:28:10Z) - Latent Space Energy-based Neural ODEs [73.01344439786524]
This paper introduces novel deep dynamical models designed to represent continuous-time sequences.<n>We train the model using maximum likelihood estimation with Markov chain Monte Carlo.<n> Experimental results on oscillating systems, videos and real-world state sequences (MuJoCo) demonstrate that our model with the learnable energy-based prior outperforms existing counterparts.
arXiv Detail & Related papers (2024-09-05T18:14:22Z) - A Poisson-Gamma Dynamic Factor Model with Time-Varying Transition Dynamics [51.147876395589925]
A non-stationary PGDS is proposed to allow the underlying transition matrices to evolve over time.
A fully-conjugate and efficient Gibbs sampler is developed to perform posterior simulation.
Experiments show that, in comparison with related models, the proposed non-stationary PGDS achieves improved predictive performance.
arXiv Detail & Related papers (2024-02-26T04:39:01Z) - Attractor Memory for Long-Term Time Series Forecasting: A Chaos Perspective [63.60312929416228]
textbftextitAttraos incorporates chaos theory into long-term time series forecasting.
We show that Attraos outperforms various LTSF methods on mainstream datasets and chaotic datasets with only one-twelfth of the parameters compared to PatchTST.
arXiv Detail & Related papers (2024-02-18T05:35:01Z) - EgPDE-Net: Building Continuous Neural Networks for Time Series
Prediction with Exogenous Variables [22.145726318053526]
Inter-series correlation and time dependence among variables are rarely considered in the present continuous methods.
We propose a continuous-time model for arbitrary-step prediction to learn an unknown PDE system.
arXiv Detail & Related papers (2022-08-03T08:34:31Z) - Learning to Accelerate Partial Differential Equations via Latent Global
Evolution [64.72624347511498]
Latent Evolution of PDEs (LE-PDE) is a simple, fast and scalable method to accelerate the simulation and inverse optimization of PDEs.
We introduce new learning objectives to effectively learn such latent dynamics to ensure long-term stability.
We demonstrate up to 128x reduction in the dimensions to update, and up to 15x improvement in speed, while achieving competitive accuracy.
arXiv Detail & Related papers (2022-06-15T17:31:24Z) - Dynamics of Ultracold Bosons in Artificial Gauge Fields: Angular
Momentum, Fragmentation, and the Variance of Entropy [0.0]
We consider the dynamics of two-dimensional interacting ultracold bosons triggered by suddenly switching on an artificial gauge field.
We analyze the emergent dynamics by monitoring the angular momentum, the fragmentation as well the entropy and variance of the entropy of absorption or single-shot images.
arXiv Detail & Related papers (2020-12-17T19:00:03Z) - Stochastically forced ensemble dynamic mode decomposition for
forecasting and analysis of near-periodic systems [65.44033635330604]
We introduce a novel load forecasting method in which observed dynamics are modeled as a forced linear system.
We show that its use of intrinsic linear dynamics offers a number of desirable properties in terms of interpretability and parsimony.
Results are presented for a test case using load data from an electrical grid.
arXiv Detail & Related papers (2020-10-08T20:25:52Z) - Optimization with Momentum: Dynamical, Control-Theoretic, and Symplectic
Perspectives [97.16266088683061]
The article rigorously establishes why symplectic discretization schemes are important for momentum-based optimization algorithms.
It provides a characterization of algorithms that exhibit accelerated convergence.
arXiv Detail & Related papers (2020-02-28T00:32:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.