Related papers: Deep learning for pedestrians: backpropagation in Transformers

Deep learning for pedestrians: backpropagation in Transformers

URL: http://arxiv.org/abs/2512.23329v1
Date: Mon, 29 Dec 2025 09:26:19 GMT
Title: Deep learning for pedestrians: backpropagation in Transformers
Authors: Laurent Boué,
Abstract summary: We apply our index-free methodology to new types of layers such as embedding, multi-headed self-attention and layer normalization.<n>A complete PyTorch implementation of a minimalistic GPT-like network is also provided along with analytical expressions for of all of its gradient updates.
Score: 1.14219428942199
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This document is a follow-up to our previous paper dedicated to a vectorized derivation of backpropagation in CNNs. Following the same principles and notations already put in place there, we now focus on transformer-based next-token-prediction architectures. To this end, we apply our lightweight index-free methodology to new types of layers such as embedding, multi-headed self-attention and layer normalization. In addition, we also provide gradient expressions for LoRA layers to illustrate parameter-efficient fine-tuning. Why bother doing manual backpropagation when there are so many tools that do this automatically? Any gap in understanding of how values propagate forward will become evident when attempting to differentiate the loss function. By working through the backward pass manually, we gain a deeper intuition for how each operation influences the final output. A complete PyTorch implementation of a minimalistic GPT-like network is also provided along with analytical expressions for of all of its gradient updates.

Related papers

The Normalized Difference Layer: A Differentiable Spectral Index Formulation for Deep Learning [0.5131152350448098]
We introduce the Normalized Difference Layer that is a differentiable neural network module.<n>We present a complete mathematical framework for integrating this layer into deep learning architectures.<n> Experiments show that models using this layer reach similar classification accuracy to standard multilayer perceptrons.
arXiv Detail & Related papers (2026-01-11T05:03:01Z)
A Truly Sparse and General Implementation of Gradient-Based Synaptic Plasticity [0.7617849765320394]
We present a custom automatic differentiation (AD) pipeline for sparse and online implementation of gradient-based synaptic plasticity rules.<n>Our work combines the programming ease of backpropagation-type methods for forward AD while being memory-efficient.<n>We demonstrate how memory utilization scales with network size without dependence on the sequence length.
arXiv Detail & Related papers (2025-01-20T11:14:11Z)
Learning effective pruning at initialization from iterative pruning [15.842658282636876]
We present an end-to-end neural network-based PaI method to reduce training costs. Our approach outperforms existing methods in high-sparsity settings. As the first neural network-based PaI method, we conduct extensive experiments to validate the factors influencing this approach.
arXiv Detail & Related papers (2024-08-27T03:17:52Z)
Extraction Propagation [4.368185344922342]
We present an alternative architecture composed of many small neural networks that interact with one another.<n>Instead of propagating gradients back through the architecture we propagate vector-valued messages computed via forward passes.
arXiv Detail & Related papers (2024-02-24T19:06:41Z)
How to guess a gradient [68.98681202222664]
We show that gradients are more structured than previously thought. Exploiting this structure can significantly improve gradient-free optimization schemes. We highlight new challenges in overcoming the large gap between optimizing with exact gradients and guessing the gradients.
arXiv Detail & Related papers (2023-12-07T21:40:44Z)
Rethinking PGD Attack: Is Sign Function Necessary? [131.6894310945647]
We present a theoretical analysis of how such sign-based update algorithm influences step-wise attack performance. We propose a new raw gradient descent (RGD) algorithm that eliminates the use of sign. The effectiveness of the proposed RGD algorithm has been demonstrated extensively in experiments.
arXiv Detail & Related papers (2023-12-03T02:26:58Z)
Jump to Conclusions: Short-Cutting Transformers With Linear Transformations [60.37563766047492]
Transformer-based language models create hidden representations of their inputs at every layer, but only use final-layer representations for prediction. This obscures the internal decision-making process of the model and the utility of its intermediate representations. We suggest a simple method for such casting, using linear transformations.
arXiv Detail & Related papers (2023-03-16T16:10:16Z)
Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks. We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z)
Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse [11.486545294602697]
We shed new light on the causes and effects of rank collapse in Transformers. We show that rank collapse of the tokens' representations hinders training by causing the gradients of the queries and keys to vanish.
arXiv Detail & Related papers (2022-06-07T09:07:24Z)
Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning [97.28695683236981]
More gradient updates decrease the expressivity of the current value network. We demonstrate this phenomenon on Atari and Gym benchmarks, in both offline and online RL settings.
arXiv Detail & Related papers (2020-10-27T17:55:16Z)
DHP: Differentiable Meta Pruning via HyperNetworks [158.69345612783198]
This paper introduces a differentiable pruning method via hypernetworks for automatic network pruning. Latent vectors control the output channels of the convolutional layers in the backbone network and act as a handle for the pruning of the layers. Experiments are conducted on various networks for image classification, single image super-resolution, and denoising.
arXiv Detail & Related papers (2020-03-30T17:59:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.