Related papers: Breaking the Memory Wall: Exact Analytical Differentiation via Tiled Operator-Space Evolution

Breaking the Memory Wall: Exact Analytical Differentiation via Tiled Operator-Space Evolution

URL: http://arxiv.org/abs/2512.23068v1
Date: Sun, 28 Dec 2025 20:27:58 GMT
Title: Breaking the Memory Wall: Exact Analytical Differentiation via Tiled Operator-Space Evolution
Authors: Shuhuan Wang, Yuzhen Xie, Jiayi Li, Yinliang Diao,
Abstract summary: Phase Gradient Flow (PGF) is a framework that computes exact analytical derivatives by operating directly in the state-space manifold.<n>Our method delivers O(1) memory complexity relative to sequence length, yielding a 94% reduction in peak VRAM and a 23x increase in throughput compared to standard Autograd.<n>Our work enables chromosome-scale sensitivity analysis on a single GPU, bridging the gap between theoretical infinite-context models and practical hardware limitations.
Score: 3.551701030393209
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Selective State Space Models (SSMs) achieve linear-time inference, yet their gradient-based sensitivity analysis remains bottlenecked by O(L) memory scaling during backpropagation. This memory constraint precludes genomic-scale modeling (L > 10^5) on consumer-grade hardware. We introduce Phase Gradient Flow (PGF), a framework that computes exact analytical derivatives by operating directly in the state-space manifold, bypassing the need to materialize the intermediate computational graph. By reframing SSM dynamics as Tiled Operator-Space Evolution (TOSE), our method delivers O(1) memory complexity relative to sequence length, yielding a 94% reduction in peak VRAM and a 23x increase in throughput compared to standard Autograd. Unlike parallel prefix scans that exhibit numerical divergence in stiff ODE regimes, PGF ensures stability through invariant error scaling, maintaining near-machine precision across extreme sequences. We demonstrate the utility of PGF on an impulse-response benchmark with 128,000-step sequences - a scale where conventional Autograd encounters prohibitive memory overhead, often leading to out-of-memory (OOM) failures in multi-layered models. Our work enables chromosome-scale sensitivity analysis on a single GPU, bridging the gap between theoretical infinite-context models and practical hardware limitations.

Related papers

From Complex Dynamics to DynFormer: Rethinking Transformers for PDEs [6.873342825786888]
Transformer-based neural operators have emerged as powerful data-driven alternatives.<n>We propose DynFormer, a novel dynamics-informed neural operator.<n>We show that DynFormer achieves up to a 95% reduction in relative error compared to state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-03T15:45:09Z)
Exact Discrete Stochastic Simulation with Deep-Learning-Scale Gradient Optimization [0.0]
Exact simulation of continuous-time Markov chains (CTMCs) is essential when discreteness and noise drive system behavior, but the hard categorical event selection in Gillespie-type algorithms blocks gradient-based learning.<n>We eliminate this constraint by decoupling forward simulation from backward differentiation, with hard categorical sampling generating exact trajectories and gradients propagating through a continuous massively-parallel Gumbel-Softmax straight-through surrogate.<n>Our results enable high-dimensional parameter inference and inverse design across systems biology, chemical kinetics, physics, and related CTMC-governed domains.
arXiv Detail & Related papers (2026-02-23T12:29:43Z)
Unifying Learning Dynamics and Generalization in Transformers Scaling Law [1.5229257192293202]
The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources.<n>This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system.<n>Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data.
arXiv Detail & Related papers (2025-12-26T17:20:09Z)
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression [53.48692193399171]
Gated KalmaNet (GKA) is a layer that reduces the gap by accounting for the full past when predicting the next token.<n>We solve an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length.<n>On long-context, GKA excels at real-world RAG and LongQA tasks up to 128k tokens, achieving more than $10$% relative improvement over other fading memory baselines.
arXiv Detail & Related papers (2025-11-26T03:26:37Z)
The Curious Case of In-Training Compression of State Space Models [49.819321766705514]
State Space Models (SSMs) tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference.<n>Key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden.<n>Our approach, textscCompreSSM, applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models.
arXiv Detail & Related papers (2025-10-03T09:02:33Z)
More Optimal Fractional-Order Stochastic Gradient Descent for Non-Convex Optimization Problems [2.5971517743176915]
We propose 2SED Fractional-Order Gradient Descent (2FOSGD), which integrates the Two-Scale Effective Dimension (2SED) with FOSGD.<n>By tracking sensitivity and effective dimensionality, 2SEDFOSGD dynamically modulates the exponent to mitigate sluggish oscillations and hasten convergence.
arXiv Detail & Related papers (2025-05-05T19:27:36Z)
Effective Dimension Aware Fractional-Order Stochastic Gradient Descent for Convex Optimization Problems [2.5971517743176915]
We propose 2SED Fractional-Order Gradient Descent (2SEDFOSGD) to adapt the fractional exponent in a data-driven manner.<n>Theoretically, this approach preserves the advantages of fractional memory without the sluggish or unstable behavior observed in na"ive fractional SGD.
arXiv Detail & Related papers (2025-03-17T22:57:37Z)
TensorGRaD: Tensor Gradient Robust Decomposition for Memory-Efficient Neural Operator Training [91.8932638236073]
We introduce textbfTensorGRaD, a novel method that directly addresses the memory challenges associated with large-structured weights.<n>We show that sparseGRaD reduces total memory usage by over $50%$ while maintaining and sometimes even improving accuracy.
arXiv Detail & Related papers (2025-01-04T20:51:51Z)
Multi-Grid Tensorized Fourier Neural Operator for High-Resolution PDEs [93.82811501035569]
We introduce a new data efficient and highly parallelizable operator learning approach with reduced memory requirement and better generalization. MG-TFNO scales to large resolutions by leveraging local and global structures of full-scale, real-world phenomena. We demonstrate superior performance on the turbulent Navier-Stokes equations where we achieve less than half the error with over 150x compression.
arXiv Detail & Related papers (2023-09-29T20:18:52Z)
Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z)
Memory-Efficient Differentiable Programming for Quantum Optimal Control of Discrete Lattices [1.5012666537539614]
Quantum optimal control problems are typically solved by gradient-based algorithms such as GRAPE. QOC reveals that memory requirements are a barrier for simulating large models or long time spans. We employ a nonstandard differentiable programming approach that significantly reduces the memory requirements at the cost of a reasonable amount of recomputation.
arXiv Detail & Related papers (2022-10-15T20:59:23Z)
NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer [45.47667026025716]
We propose a novel, robust and accelerated iteration that relies on two key elements. The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively. We show that NAG-arity is competitive with state-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models.
arXiv Detail & Related papers (2022-09-29T16:54:53Z)
Iterative Refinement in the Continuous Space for Non-Autoregressive Neural Machine Translation [68.25872110275542]
We propose an efficient inference procedure for non-autoregressive machine translation. It iteratively refines translation purely in the continuous space. We evaluate our approach on WMT'14 En-De, WMT'16 Ro-En and IWSLT'16 De-En.
arXiv Detail & Related papers (2020-09-15T15:30:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.