Related papers: The Affine Divergence: Aligning Activation Updates Beyond Normalisation

The Affine Divergence: Aligning Activation Updates Beyond Normalisation

URL: http://arxiv.org/abs/2512.22247v1
Date: Wed, 24 Dec 2025 00:31:22 GMT
Title: The Affine Divergence: Aligning Activation Updates Beyond Normalisation
Authors: George Bird,
Abstract summary: A systematic mismatch exists between mathematically ideal and effective activation updates during gradient descent.<n>It is argued that normalisers are better into activation-function-like maps with parameterised scaling, thereby aiding the prioritisation of representations during optimisation.<n>This constitutes a theoretical-principled approach that yields several new functions that are empirically validated and raises questions about the affine + nonlinear approach to model creation.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A systematic mismatch exists between mathematically ideal and effective activation updates during gradient descent. As intended, parameters update in their direction of steepest descent. However, activations are argued to constitute a more directly impactful quantity to prioritise in optimisation, as they are closer to the loss in the computational graph and carry sample-dependent information through the network. Yet their propagated updates do not take the optimal steepest-descent step. These quantities exhibit non-ideal sample-wise scaling across affine, convolutional, and attention layers. Solutions to correct for this are trivial and, entirely incidentally, derive normalisation from first principles despite motivational independence. Consequently, such considerations offer a fresh and conceptual reframe of normalisation's action, with auxiliary experiments bolstering this mechanistically. Moreover, this analysis makes clear a second possibility: a solution that is functionally distinct from modern normalisations, without scale-invariance, yet remains empirically successful, outperforming conventional normalisers across several tests. This is presented as an alternative to the affine map. This generalises to convolution via a new functional form, "PatchNorm", a compositionally inseparable normaliser. Together, these provide an alternative mechanistic framework that adds to, and counters some of, the discussion of normalisation. Further, it is argued that normalisers are better decomposed into activation-function-like maps with parameterised scaling, thereby aiding the prioritisation of representations during optimisation. Overall, this constitutes a theoretical-principled approach that yields several new functions that are empirically validated and raises questions about the affine + nonlinear approach to model creation.

Related papers

Variational Deep Learning via Implicit Regularization [11.296548737163599]
Modern deep learning models generalize remarkably well in-distribution, despite being overparametrized and trained with little to no explicit regularization.<n>We propose to regularize variational neural networks solely by relying on the implicit bias of (stochastic) gradient descent.
arXiv Detail & Related papers (2025-05-26T17:15:57Z)
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
Error Feedback under $(L_0,L_1)$-Smoothness: Normalization and Momentum [56.37522020675243]
We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems. We show that due to their larger allowable stepsizes, our new normalized error feedback algorithms outperform their non-normalized counterparts on various tasks.
arXiv Detail & Related papers (2024-10-22T10:19:27Z)
Vanishing Feature: Diagnosing Model Merging and Beyond [1.1510009152620668]
We identify the vanishing feature'' phenomenon, where input-induced features diminish during propagation through a merged model.<n>We show that existing normalization strategies can be enhanced by precisely targeting the vanishing feature issue.<n>We propose the Preserve-First Merging'' (PFM) strategy, which focuses on preserving early-layer features.
arXiv Detail & Related papers (2024-02-05T17:06:26Z)
Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult [49.8719617899285]
Large learning rates, when applied to objective descent for non optimization, yield various implicit biases including the edge of stability. This paper provides an initial step in descent and shows that these implicit biases are in fact various tips same iceberg.
arXiv Detail & Related papers (2023-10-26T01:11:17Z)
Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
Smoothing the Edges: Smooth Optimization for Sparse Regularization using Hadamard Overparametrization [10.009748368458409]
We present a framework for smooth optimization of explicitly regularized objectives for (structured) sparsity. Our method enables fully differentiable approximation-free optimization and is thus compatible with the ubiquitous gradient descent paradigm in deep learning.
arXiv Detail & Related papers (2023-07-07T13:06:12Z)
PAC-Chernoff Bounds: Understanding Generalization in the Interpolation Regime [6.645111950779666]
This paper introduces a distribution-dependent PAC-Chernoff bound that exhibits perfect tightness for interpolators.<n>We present a unified theoretical framework revealing why certain interpolators show an exceptional generalization, while others falter.
arXiv Detail & Related papers (2023-06-19T14:07:10Z)
Matrix Completion via Non-Convex Relaxation and Adaptive Correlation Learning [90.8576971748142]
We develop a novel surrogate that can be optimized by closed-form solutions. We exploit upperwise correlation for completion, and thus an adaptive correlation learning model.
arXiv Detail & Related papers (2022-03-04T08:50:50Z)
Support recovery and sup-norm convergence rates for sparse pivotal estimation [79.13844065776928]
In high dimensional sparse regression, pivotal estimators are estimators for which the optimal regularization parameter is independent of the noise level. We show minimax sup-norm convergence rates for non smoothed and smoothed, single task and multitask square-root Lasso-type estimators.
arXiv Detail & Related papers (2020-01-15T16:11:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.