Regime Change Hypothesis: Foundations for Decoupled Dynamics in Neural Network Training
- URL: http://arxiv.org/abs/2602.08333v1
- Date: Mon, 09 Feb 2026 07:14:28 GMT
- Title: Regime Change Hypothesis: Foundations for Decoupled Dynamics in Neural Network Training
- Authors: Cristian Pérez-Corral, Alberto Fernández-Hernández, Jose I. Mestre, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ortí,
- Abstract summary: In ReLU-based models, the activation pattern induced by a given input determines the piecewise-linear region in which the network behaves affinely.<n>We investigate whether training exhibits a two-timescale behavior: an early stage with substantial changes in activation patterns and a later stage where weight updates predominantly refine the model.
- Score: 1.0518862318418603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the empirical success of DNN, their internal training dynamics remain difficult to characterize. In ReLU-based models, the activation pattern induced by a given input determines the piecewise-linear region in which the network behaves affinely. Motivated by this geometry, we investigate whether training exhibits a two-timescale behavior: an early stage with substantial changes in activation patterns and a later stage where weight updates predominantly refine the model within largely stable activation regimes. We first prove a local stability property: outside measure-zero sets of parameters and inputs, sufficiently small parameter perturbations preserve the activation pattern of a fixed input, implying locally affine behavior within activation regions. We then empirically track per-iteration changes in weights and activation patterns across fully-connected and convolutional architectures, as well as Transformer-based models, where activation patterns are recorded in the ReLU feed-forward (MLP/FFN) submodules, using fixed validation subsets. Across the evaluated settings, activation-pattern changes decay 3 times earlier than weight-update magnitudes, showing that late-stage training often proceeds within relatively stable activation regimes. These findings provide a concrete, architecture-agnostic instrument for monitoring training dynamics and motivate further study of decoupled optimization strategies for piecewise-linear networks. For reproducibility, code and experiment configurations will be released upon acceptance.
Related papers
- Weight Updates as Activation Shifts: A Principled Framework for Steering [54.70188910511715]
Activation steering promises to be an extremely parameter-efficient form of adaptation, but its effectiveness depends on critical design choices.<n>We establish a first-order equivalence between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior.<n>This equivalence yields a principled framework for steering design and identifies the post-block output as a theoretically-backed and highly expressive intervention site.
arXiv Detail & Related papers (2026-02-28T02:50:04Z) - Activation-Space Uncertainty Quantification for Pretrained Networks [2.001149416674759]
We introduce Gaussian Process Activations (GAPA), a post-hoc method that shifts Bayesian modeling from weights to activations.<n>GAPA replaces standard nonlinearities with activations whose posterior mean exactly matches the original activation, preserving the backbone's point predictions by construction.<n>To scale to modern architectures, we use a sparse variational inducing-point approximation over cached training activations, combined with local k-nearest-neighbor conditioning.
arXiv Detail & Related papers (2026-02-16T17:17:08Z) - Activation Function Design Sustains Plasticity in Continual Learning [1.618563064839635]
In continual learning, models can progressively lose the ability to adapt.<n>We show that activation choice is a primary, architecture-agnostic lever for mitigating plasticity loss.
arXiv Detail & Related papers (2025-09-26T16:41:47Z) - Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks [3.924071936547547]
Gated neural networks (RNNs) implicitly induce adaptive learning-rate behavior.<n>Effect arises from the coupling between state-space time scales--parametrized by the gates--and parameter-space dynamics.<n> Empirical simulations corroborate these claims.
arXiv Detail & Related papers (2025-08-16T18:19:34Z) - Weight-Space Linear Recurrent Neural Networks [2.77067514910801]
WARP (Weight-space Adaptive Recurrent Prediction) is a powerful model that unifies weight-space learning with linear recurrence.<n>We show that WARP matches or surpasses state-of-the-art baselines on diverse classification tasks.<n>Remarkably, a physics-informed variant of our model outperforms the next best model by more than 10x.
arXiv Detail & Related papers (2025-06-01T20:13:28Z) - Weight Spectra Induced Efficient Model Adaptation [54.8615621415845]
Fine-tuning large-scale foundation models incurs prohibitive computational costs.<n>We show that fine-tuning predominantly amplifies the top singular values while leaving the remainder largely intact.<n>We propose a novel method that leverages learnable rescaling of top singular directions.
arXiv Detail & Related papers (2025-05-29T05:03:29Z) - PreAdaptFWI: Pretrained-Based Adaptive Residual Learning for Full-Waveform Inversion Without Dataset Dependency [8.719356558714246]
Full-waveform inversion (FWI) is a method that utilizes seismic data to invert the physical parameters of subsurface media.<n>Due to its ill-posed nature, FWI is susceptible to getting trapped in local minima.<n>Various research efforts have attempted to combine neural networks with FWI to stabilize the inversion process.
arXiv Detail & Related papers (2025-02-17T15:30:17Z) - Test-Time Model Adaptation with Only Forward Passes [68.11784295706995]
Test-time adaptation has proven effective in adapting a given trained model to unseen test samples with potential distribution shifts.
We propose a test-time Forward-Optimization Adaptation (FOA) method.
FOA runs on quantized 8-bit ViT, outperforms gradient-based TENT on full-precision 32-bit ViT, and achieves an up to 24-fold memory reduction on ImageNet-C.
arXiv Detail & Related papers (2024-04-02T05:34:33Z) - Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation [67.18144414660681]
We propose a Fast-Slow Test-Time Adaptation (FSTTA) approach for online Vision-and-Language Navigation (VLN)
Our method obtains impressive performance gains on four popular benchmarks.
arXiv Detail & Related papers (2023-11-22T07:47:39Z) - ENN: A Neural Network with DCT Adaptive Activation Functions [2.2713084727838115]
We present Expressive Neural Network (ENN), a novel model in which the non-linear activation functions are modeled using the Discrete Cosine Transform (DCT)
This parametrization keeps the number of trainable parameters low, is appropriate for gradient-based schemes, and adapts to different learning tasks.
The performance of ENN outperforms state of the art benchmarks, providing above a 40% gap in accuracy in some scenarios.
arXiv Detail & Related papers (2023-07-02T21:46:30Z) - Training Generative Adversarial Networks by Solving Ordinary
Differential Equations [54.23691425062034]
We study the continuous-time dynamics induced by GAN training.
From this perspective, we hypothesise that instabilities in training GANs arise from the integration error.
We experimentally verify that well-known ODE solvers (such as Runge-Kutta) can stabilise training.
arXiv Detail & Related papers (2020-10-28T15:23:49Z) - Active Tuning [0.5801044612920815]
We introduce Active Tuning, a novel paradigm for optimizing the internal dynamics of neural networks (RNNs) on the fly.
In contrast to the conventional sequence-to-imposed mapping scheme, Active Tuning decouples the RNN's recurrent neural activities from the input stream.
We demonstrate the effectiveness of Active Tuning on several time series prediction benchmarks.
arXiv Detail & Related papers (2020-10-02T20:21:58Z) - An Ode to an ODE [78.97367880223254]
We present a new paradigm for Neural ODE algorithms, called ODEtoODE, where time-dependent parameters of the main flow evolve according to a matrix flow on the group O(d)
This nested system of two flows provides stability and effectiveness of training and provably solves the gradient vanishing-explosion problem.
arXiv Detail & Related papers (2020-06-19T22:05:19Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.