Related papers: Process-Tensor Tomography of SGD: Measuring Non-Markovian Memory via Back-Flow of Distinguishability

Process-Tensor Tomography of SGD: Measuring Non-Markovian Memory via Back-Flow of Distinguishability

URL: http://arxiv.org/abs/2601.16563v1
Date: Fri, 23 Jan 2026 09:03:25 GMT
Title: Process-Tensor Tomography of SGD: Measuring Non-Markovian Memory via Back-Flow of Distinguishability
Authors: Vasileios Sevetlidis, George Pavlidis,
Abstract summary: We build a simple model-agnostic witness of training memory based on emphback-flow of distinguishability.<n>We observe consistent positive back-flow with tight bootstrap confidence intervals, amplification under higher momentum, and more micro-steps.<n>We position this as a principled diagnostic and empirical evidence that practical SGD deviates from the Markov idealization.
Score: 1.078600700827543
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work proposes neural training as a \emph{process tensor}: a multi-time map that takes a sequence of controllable instruments (batch choices, augmentations, optimizer micro-steps) and returns an observable of the trained model. Building on this operational lens, we introduce a simple, model-agnostic witness of training memory based on \emph{back-flow of distinguishability}. In a controlled two-step protocol, we compare outcome distributions after one intervention versus two; the increase $Δ_{\mathrm{BF}} = D_2 - D_1>0$ (with $D\in\{\mathrm{TV}, \mathrm{JS}, \mathrm{H}\}$ measured on softmax predictions over a fixed probe set) certifies non-Markovianity. We observe consistent positive back-flow with tight bootstrap confidence intervals, amplification under higher momentum, larger batch overlap, and more micro-steps, and collapse under a \emph{causal break} (resetting optimizer state), directly attributing the effect to optimizer/data-state memory. The witness is robust across TV/JS/Hellinger, inexpensive to compute, and requires no architectural changes. We position this as a \emph{measurement} contribution: a principled diagnostic and empirical evidence that practical SGD deviates from the Markov idealization. An exploratory case study illustrates how the micro-level signal can inform curriculum orderings. "Data order matters" turns into a testable operator with confidence bounds, our framework offers a common stage to compare optimizers, curricula, and schedules through their induced training memory.

Related papers

Information Hidden in Gradients of Regression with Target Noise [2.8911861322232686]
We show that the gradients alone can reveal the Hessian.<n>We provide non-asymptotic operator-norm guarantees under sub-Gaussian inputs.
arXiv Detail & Related papers (2026-01-26T14:50:16Z)
Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds [0.4779196219827507]
We show how cross-entropy training reshapes attention scores and value vectors in a transformer attention head.<n>Our core result is an emphadvantage-based routing law for attention scores.<n>We show that this coupled specialization behaves like a two-timescale EM procedure.
arXiv Detail & Related papers (2025-12-27T05:31:44Z)
Phase-space entropy at acquisition reflects downstream learnability [54.4100065023873]
We propose an acquisition-level scalar $S_mathcal B$ based on instrument-resolved phase space.<n>We show theoretically that (S_mathcal B) correctly identifies the phase-space coherence of periodic sampling.<n>$|S_mathcal B|$ consistently ranks sampling geometries and predicts downstream reconstruction/recognition difficulty emphwithout training.
arXiv Detail & Related papers (2025-12-22T10:03:51Z)
Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning [71.30276778807068]
We propose a unified framework that strategically coordinates sample pruning and token pruning.<n>Q-Tuning achieves a +38% average improvement over the full-data SFT baseline using only 12.5% of the original training data.
arXiv Detail & Related papers (2025-09-28T13:27:38Z)
Unlearning at Scale: Implementing the Right to be Forgotten in Large Language Models [0.0]
Our approach treats as a minimal program and logs permicrobatch record.<n>Under pinned stack and deterministic kernels, replaying the training tail yields the same parameters as training retain set.
arXiv Detail & Related papers (2025-08-17T03:29:22Z)
Uncertainty-Aware Graph Self-Training with Expectation-Maximization Regularization [2.743479615751918]
We propose a novel emphuncertainty-aware graph self-training approach for semi-supervised node classification.<n>Our method incorporates an uncertainty mechanism during pseudo-label generation and model retraining.<n>Our framework is designed to handle noisy graph structures and feature spaces more effectively.
arXiv Detail & Related papers (2025-03-26T21:52:21Z)
Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Mutual-Information Based Few-Shot Classification [34.95314059362982]
We introduce Transductive Infomation Maximization (TIM) for few-shot learning. Our method maximizes the mutual information between the query features and their label predictions for a given few-shot task. We propose a new alternating-direction solver, which speeds up transductive inference over gradient-based optimization.
arXiv Detail & Related papers (2021-06-23T09:17:23Z)
Coping with Label Shift via Distributionally Robust Optimisation [72.80971421083937]
We propose a model that minimises an objective based on distributionally robust optimisation (DRO) We then design and analyse a gradient descent-proximal mirror ascent algorithm tailored for large-scale problems to optimise the proposed objective.
arXiv Detail & Related papers (2020-10-23T08:33:04Z)
Least Squares Regression with Markovian Data: Fundamental Limits and Algorithms [69.45237691598774]
We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain. We establish sharp information theoretic minimax lower bounds for this problem in terms of $tau_mathsfmix$. We propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate.
arXiv Detail & Related papers (2020-06-16T04:26:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.