Breaking the Memory Wall: Exact Analytical Differentiation via Tiled Operator-Space Evolution
- URL: http://arxiv.org/abs/2512.23068v1
- Date: Sun, 28 Dec 2025 20:27:58 GMT
- Title: Breaking the Memory Wall: Exact Analytical Differentiation via Tiled Operator-Space Evolution
- Authors: Shuhuan Wang, Yuzhen Xie, Jiayi Li, Yinliang Diao,
- Abstract summary: Phase Gradient Flow (PGF) is a framework that computes exact analytical derivatives by operating directly in the state-space manifold.<n>Our method delivers O(1) memory complexity relative to sequence length, yielding a 94% reduction in peak VRAM and a 23x increase in throughput compared to standard Autograd.<n>Our work enables chromosome-scale sensitivity analysis on a single GPU, bridging the gap between theoretical infinite-context models and practical hardware limitations.
- Score: 3.551701030393209
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Selective State Space Models (SSMs) achieve linear-time inference, yet their gradient-based sensitivity analysis remains bottlenecked by O(L) memory scaling during backpropagation. This memory constraint precludes genomic-scale modeling (L > 10^5) on consumer-grade hardware. We introduce Phase Gradient Flow (PGF), a framework that computes exact analytical derivatives by operating directly in the state-space manifold, bypassing the need to materialize the intermediate computational graph. By reframing SSM dynamics as Tiled Operator-Space Evolution (TOSE), our method delivers O(1) memory complexity relative to sequence length, yielding a 94% reduction in peak VRAM and a 23x increase in throughput compared to standard Autograd. Unlike parallel prefix scans that exhibit numerical divergence in stiff ODE regimes, PGF ensures stability through invariant error scaling, maintaining near-machine precision across extreme sequences. We demonstrate the utility of PGF on an impulse-response benchmark with 128,000-step sequences - a scale where conventional Autograd encounters prohibitive memory overhead, often leading to out-of-memory (OOM) failures in multi-layered models. Our work enables chromosome-scale sensitivity analysis on a single GPU, bridging the gap between theoretical infinite-context models and practical hardware limitations.
Related papers
- From Complex Dynamics to DynFormer: Rethinking Transformers for PDEs [6.873342825786888]
Transformer-based neural operators have emerged as powerful data-driven alternatives.<n>We propose DynFormer, a novel dynamics-informed neural operator.<n>We show that DynFormer achieves up to a 95% reduction in relative error compared to state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-03T15:45:09Z) - Exact Discrete Stochastic Simulation with Deep-Learning-Scale Gradient Optimization [0.0]
Exact simulation of continuous-time Markov chains (CTMCs) is essential when discreteness and noise drive system behavior, but the hard categorical event selection in Gillespie-type algorithms blocks gradient-based learning.<n>We eliminate this constraint by decoupling forward simulation from backward differentiation, with hard categorical sampling generating exact trajectories and gradients propagating through a continuous massively-parallel Gumbel-Softmax straight-through surrogate.<n>Our results enable high-dimensional parameter inference and inverse design across systems biology, chemical kinetics, physics, and related CTMC-governed domains.
arXiv Detail & Related papers (2026-02-23T12:29:43Z) - Unifying Learning Dynamics and Generalization in Transformers Scaling Law [1.5229257192293202]
The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources.<n>This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system.<n>Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data.
arXiv Detail & Related papers (2025-12-26T17:20:09Z) - Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression [53.48692193399171]
Gated KalmaNet (GKA) is a layer that reduces the gap by accounting for the full past when predicting the next token.<n>We solve an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length.<n>On long-context, GKA excels at real-world RAG and LongQA tasks up to 128k tokens, achieving more than $10$% relative improvement over other fading memory baselines.
arXiv Detail & Related papers (2025-11-26T03:26:37Z) - The Curious Case of In-Training Compression of State Space Models [49.819321766705514]
State Space Models (SSMs) tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference.<n>Key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden.<n>Our approach, textscCompreSSM, applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models.
arXiv Detail & Related papers (2025-10-03T09:02:33Z) - More Optimal Fractional-Order Stochastic Gradient Descent for Non-Convex Optimization Problems [2.5971517743176915]
We propose 2SED Fractional-Order Gradient Descent (2FOSGD), which integrates the Two-Scale Effective Dimension (2SED) with FOSGD.<n>By tracking sensitivity and effective dimensionality, 2SEDFOSGD dynamically modulates the exponent to mitigate sluggish oscillations and hasten convergence.
arXiv Detail & Related papers (2025-05-05T19:27:36Z) - Effective Dimension Aware Fractional-Order Stochastic Gradient Descent for Convex Optimization Problems [2.5971517743176915]
We propose 2SED Fractional-Order Gradient Descent (2SEDFOSGD) to adapt the fractional exponent in a data-driven manner.<n>Theoretically, this approach preserves the advantages of fractional memory without the sluggish or unstable behavior observed in na"ive fractional SGD.
arXiv Detail & Related papers (2025-03-17T22:57:37Z) - TensorGRaD: Tensor Gradient Robust Decomposition for Memory-Efficient Neural Operator Training [91.8932638236073]
We introduce textbfTensorGRaD, a novel method that directly addresses the memory challenges associated with large-structured weights.<n>We show that sparseGRaD reduces total memory usage by over $50%$ while maintaining and sometimes even improving accuracy.
arXiv Detail & Related papers (2025-01-04T20:51:51Z) - Multi-Grid Tensorized Fourier Neural Operator for High-Resolution PDEs [93.82811501035569]
We introduce a new data efficient and highly parallelizable operator learning approach with reduced memory requirement and better generalization.
MG-TFNO scales to large resolutions by leveraging local and global structures of full-scale, real-world phenomena.
We demonstrate superior performance on the turbulent Navier-Stokes equations where we achieve less than half the error with over 150x compression.
arXiv Detail & Related papers (2023-09-29T20:18:52Z) - Implicit Bias of Gradient Descent for Logistic Regression at the Edge of
Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS)
This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z) - Memory-Efficient Differentiable Programming for Quantum Optimal Control
of Discrete Lattices [1.5012666537539614]
Quantum optimal control problems are typically solved by gradient-based algorithms such as GRAPE.
QOC reveals that memory requirements are a barrier for simulating large models or long time spans.
We employ a nonstandard differentiable programming approach that significantly reduces the memory requirements at the cost of a reasonable amount of recomputation.
arXiv Detail & Related papers (2022-10-15T20:59:23Z) - NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer [45.47667026025716]
We propose a novel, robust and accelerated iteration that relies on two key elements.
The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively.
We show that NAG-arity is competitive with state-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models.
arXiv Detail & Related papers (2022-09-29T16:54:53Z) - Iterative Refinement in the Continuous Space for Non-Autoregressive
Neural Machine Translation [68.25872110275542]
We propose an efficient inference procedure for non-autoregressive machine translation.
It iteratively refines translation purely in the continuous space.
We evaluate our approach on WMT'14 En-De, WMT'16 Ro-En and IWSLT'16 De-En.
arXiv Detail & Related papers (2020-09-15T15:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.