Transformer Injectivity & Geometric Robustness - Analytic Margins and Bi-Lipschitz Uniformity of Sequence-Level Hidden States
- URL: http://arxiv.org/abs/2511.14808v1
- Date: Mon, 17 Nov 2025 19:39:15 GMT
- Title: Transformer Injectivity & Geometric Robustness - Analytic Margins and Bi-Lipschitz Uniformity of Sequence-Level Hidden States
- Authors: Mikael von Strauss,
- Abstract summary: We show that the map from discrete prompts to last-token hidden states is generically injective on finite prompt sets.<n>We study behavior across layers, sequence lengths, model scales, and 8- and 4-bit activation quantization.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Under real-analytic assumptions on decoder-only Transformers, recent work shows that the map from discrete prompts to last-token hidden states is generically injective on finite prompt sets. We refine this picture: for each layer $\ell$ we define a collision discriminant $Δ^\ell \subset Θ$ and injective stratum $U^\ell = Θ\setminus Δ^\ell$, and prove a dichotomy -- either the model is nowhere injective on the set, or $U^\ell$ is open and dense and every $F^\ell_θ$ is injective. Under mild non-singularity assumptions on the optimizer and an absolutely continuous initialization, generic injectivity persists along smooth training trajectories over any fixed horizon. We also treat symmetry groups $G$, showing that discriminants and injective strata descend to the quotient $Θ/G$, so injectivity is naturally a property of functional equivalence classes. We complement these results with an empirical study of layerwise geometric diagnostics. We define a separation margin and a co-Lipschitz (lower Lipschitz) constant between prompt space and last-token representation space, estimated via nearest-neighbor statistics on large prompt sets. Applying these diagnostics to pretrained LLaMA-3 and Qwen models, we study behavior across layers, sequence lengths, model scales, and 8- and 4-bit activation quantization. On our sampled prompts we see no collisions in full precision or at 8 bits, while 4-bit quantization induces a small number of collisions and markedly shrinks co-Lipschitz estimates. For a small GPT-2 trained from scratch, normalized metrics remain stable over training. Overall, the results suggest that Transformer representations are generically and persistently injective in the continuous-parameter idealization, while their practical invertibility can be probed using simple geometric diagnostics.
Related papers
- Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning [61.07540493350384]
Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth and the teacher's own predictions.<n>We show that for any prediction risk, the optimally mixed student improves upon the ridge teacher for every regularization level.<n>We propose a consistent one-shot tuning method to estimate $star$ without grid search, sample splitting, or refitting.
arXiv Detail & Related papers (2026-02-19T17:21:15Z) - Singular Bayesian Neural Networks [1.2891210250935148]
Bayesian neural networks promise calibrated uncertainty but require $O(mn)$ parameters for standard mean-field Gaussian posteriors.<n>We induce a posterior that is singular with respect to the Lebesgue measure, concentrating on the rank-$r$ manifold.<n>We derive PAC-Bayes generalization bounds whose complexity term scales as $sqrtr(m+n)$ instead of $sqrtm n$, and prove loss bounds that decompose the error into optimization and rank-induced bias.
arXiv Detail & Related papers (2026-01-30T23:06:34Z) - Unregularized Linear Convergence in Zero-Sum Game from Preference Feedback [50.89125374999765]
We provide the first convergence guarantee for Optimistic Multiplicative Weights Update ($mathtOMWU$) in NLHF.<n>Our analysis identifies a novel marginal convergence behavior, where the probability of rarely played actions grows exponentially from exponentially small values.
arXiv Detail & Related papers (2025-12-31T12:08:29Z) - Phase-space entropy at acquisition reflects downstream learnability [54.4100065023873]
We propose an acquisition-level scalar $S_mathcal B$ based on instrument-resolved phase space.<n>We show theoretically that (S_mathcal B) correctly identifies the phase-space coherence of periodic sampling.<n>$|S_mathcal B|$ consistently ranks sampling geometries and predicts downstream reconstruction/recognition difficulty emphwithout training.
arXiv Detail & Related papers (2025-12-22T10:03:51Z) - Symmetry-Breaking Descent for Invariant Cost Functionals [0.0]
We study the problem of reducing a task cost functional $W : Hs(M) to mathbbR$, not assumed continuous or differentiable.<n>We show that symmetry-breaking deformations of the signal can reduce the cost.
arXiv Detail & Related papers (2025-05-19T15:06:31Z) - Emergence and scaling laws in SGD learning of shallow neural networks [64.48316762675141]
We study the complexity of online gradient descent (SGD) for learning a two-layer neural network with $P$ neurons on isotropic Gaussian data.<n>We provide a precise analysis of SGD dynamics for the training of a student two-layer network to minimize the mean squared error (MSE) objective.
arXiv Detail & Related papers (2025-04-28T16:58:55Z) - Criteria and Bias of Parameterized Linear Regression under Edge of Stability Regime [38.134523847923646]
Edge of Stability (EoS) is usually known as the Edge of Stability (EoS) phenomenon.<n>We show that EoS occurs even when $l$ is quadratic under proper conditions.<n>We also shed some new light on the implicit bias of diagonal linear networks when a larger step-size is employed.
arXiv Detail & Related papers (2024-12-11T02:07:37Z) - Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes [29.466981306355066]
We show that gradient descent with a fixed learning rate $eta$ can only find local minima that represent smooth functions.
We also prove a nearly-optimal MSE bound of $widetildeO(n-4/5)$ within the strict interior of the support of the $n$ data points.
arXiv Detail & Related papers (2024-06-10T22:57:27Z) - MGDA Converges under Generalized Smoothness, Provably [27.87166415148172]
Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning.<n>Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard $L$-smooth or bounded-gradient assumptions.<n>We study a more general and realistic class of generalized $ell$-smooth loss functions, where $ell$ is a general non-decreasing function of gradient norm.
arXiv Detail & Related papers (2024-05-29T18:36:59Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - Practical Sharpness-Aware Minimization Cannot Converge All the Way to
Optima [14.141453107129403]
Sharpness-Aware Minimization (SAM) is an that takes a step based on a perturbation at a $y_t = x_t + rho fracbla f(x_t)lt blablax_t)
arXiv Detail & Related papers (2023-06-16T13:47:04Z) - Convergence of Adam Under Relaxed Assumptions [72.24779199744954]
We show that Adam converges to $epsilon$-stationary points with $O(epsilon-4)$ gradient complexity under far more realistic conditions.
We also propose a variance-reduced version of Adam with an accelerated gradient complexity of $O(epsilon-3)$.
arXiv Detail & Related papers (2023-04-27T06:27:37Z) - Analytic Characterization of the Hessian in Shallow ReLU Models: A Tale
of Symmetry [9.695960412426672]
We analytically characterize the Hessian at various families of spurious minima.
In particular, we prove that for $dge k$ standard Gaussian inputs: (a) of the $dk$ eigenvalues of the Hessian, $dk - O(d)$ concentrate near zero, (b) $Omega(d)$ of the eigenvalues grow linearly with $k$.
arXiv Detail & Related papers (2020-08-04T20:08:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.