Related papers: Transformers as Intrinsic Optimizers: Forward Inference through the Energy Principle

Transformers as Intrinsic Optimizers: Forward Inference through the Energy Principle

URL: http://arxiv.org/abs/2511.00907v1
Date: Sun, 02 Nov 2025 11:58:50 GMT
Title: Transformers as Intrinsic Optimizers: Forward Inference through the Energy Principle
Authors: Ruifeng Ren, Sheng Ouyang, Huayi Tang, Yong Liu,
Abstract summary: This paper revisits the principle of energy as a lens to understand attention-based Transformer models.<n>We present a unified energy-based framework composed of three key components: the global energy $F*$, the energy function $E_i$ and the employed gradient descent (GD) form.<n>Inspired by classical GD algorithms, we extend the original attention formulation based on standard GD to the momentum-based GD, Nesterov Accelerated Gradient (NAG) and Newton's method variants, each inducing a corresponding new attention structure.
Score: 22.02194689588116
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers have demonstrated strong adaptability across a wide range of tasks and have become the backbone of modern Large Language Models (LLMs). However, their underlying mechanisms remain open for further exploration. The energy-based perspective has long provided a valuable principle for understanding neural computation. In this paper, we revisit the principle of energy as a lens to understand attention-based Transformer models. We present a unified energy-based framework which is composed of three key components: the global energy $F^*$, the energy function $E_i$ and the employed gradient descent (GD) form. Within this framework, standard softmax attention can be viewed as a special case of minimizing the Helmholtz free energy as $F^*$ using standard GD when $E_i$ takes the form of elastic potential energy, with residual connections ensuring that this optimization proceeds in an incremental manner. In addition, linear attentions can also be naturally incorporated into this framework by adjusting the corresponding energy forms. We also extend the above analysis to the multi-head setting, where the energy is defined across multiple low-dimensional subspaces. Building on this framework, we propose energy-based modifications of attention structures. Inspired by classical GD algorithms, we extend the original attention formulation based on standard GD to the momentum-based GD, Nesterov Accelerated Gradient (NAG), and Newton's method variants, each inducing a corresponding new attention structure. Our experiments provide preliminary support for the potential of the energy-based framework for designing attention mechanisms.

Related papers

YuriiFormer: A Suite of Nesterov-Accelerated Transformers [62.40952219538543]
We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings.<n>In this view, self-attention implements gradient step of an interaction energy, while layers correspond to gradient updates of a potential energy.<n>Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie-Trotter splitting between these two energys.
arXiv Detail & Related papers (2026-01-30T18:06:21Z)
Hyper-SET: Designing Transformers via Hyperspherical Energy Minimization [32.04194224236952]
We formalize token dynamics as a joint maximum likelihood estimation on the hypersphere.<n>We present textitHyper-Spherical Energy Transformer (Hyper-SET), a recurrent-depth alternative to vanilla Transformers.
arXiv Detail & Related papers (2025-02-17T10:39:11Z)
Latent Space Energy-based Neural ODEs [73.01344439786524]
This paper introduces novel deep dynamical models designed to represent continuous-time sequences.<n>We train the model using maximum likelihood estimation with Markov chain Monte Carlo.<n> Experimental results on oscillating systems, videos and real-world state sequences (MuJoCo) demonstrate that our model with the learnable energy-based prior outperforms existing counterparts.
arXiv Detail & Related papers (2024-09-05T18:14:22Z)
Smoothed Energy Guidance: Guiding Diffusion Models with Reduced Energy Curvature of Attention [0.7770029179741429]
Conditional diffusion models have shown remarkable success in visual content generation. Recent attempts to extend unconditional guidance have relied on techniques, resulting in suboptimal generation quality. We propose Smoothed Energy Guidance (SEG), a novel training- and condition-free approach to enhance image generation.
arXiv Detail & Related papers (2024-08-01T17:59:09Z)
On Feature Diversity in Energy-based Models [98.78384185493624]
An energy-based model (EBM) is typically formed of inner-model(s) that learn a combination of the different features to generate an energy mapping for each input configuration. We extend the probably approximately correct (PAC) theory of EBMs and analyze the effect of redundancy reduction on the performance of EBMs.
arXiv Detail & Related papers (2023-06-02T12:30:42Z)
Energy Transformer [64.22957136952725]
Our work combines aspects of three promising paradigms in machine learning, namely, attention mechanism, energy-based models, and associative memory. We propose a novel architecture, called the Energy Transformer (or ET for short), that uses a sequence of attention layers that are purposely designed to minimize a specifically engineered energy function.
arXiv Detail & Related papers (2023-02-14T18:51:22Z)
SGEM: stochastic gradient with energy and momentum [0.0]
We propose S, Gradient with Energy Momentum, to solve a class of general non-GEM optimization problems. SGEM incorporates both energy and momentum so as to derive energy-dependent convergence rates. Our results show that SGEM converges faster than AEGD and neural training.
arXiv Detail & Related papers (2022-08-03T16:45:22Z)
Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction [51.80191416661064]
We propose a novel vision transformer with latent variables following an informative energy-based prior for salient object detection. Both the vision transformer network and the energy-based prior model are jointly trained via Markov chain Monte Carlo-based maximum likelihood estimation. With the generative vision transformer, we can easily obtain a pixel-wise uncertainty map from an image, which indicates the model confidence in predicting saliency from the image.
arXiv Detail & Related papers (2021-12-27T06:04:33Z)
Energy-Based Processes for Exchangeable Data [109.04978766553612]
We introduce Energy-Based Processes (EBPs) to extend energy based models to exchangeable data. A key advantage of EBPs is the ability to express more flexible distributions over sets without restricting their cardinality. We develop an efficient training procedure for EBPs that demonstrates state-of-the-art performance on a variety of tasks.
arXiv Detail & Related papers (2020-03-17T04:26:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.