YuriiFormer: A Suite of Nesterov-Accelerated Transformers
- URL: http://arxiv.org/abs/2601.23236v1
- Date: Fri, 30 Jan 2026 18:06:21 GMT
- Title: YuriiFormer: A Suite of Nesterov-Accelerated Transformers
- Authors: Aleksandr Zimin, Yury Polyanskiy, Philippe Rigollet,
- Abstract summary: We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings.<n>In this view, self-attention implements gradient step of an interaction energy, while layers correspond to gradient updates of a potential energy.<n>Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie-Trotter splitting between these two energys.
- Score: 62.40952219538543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings. In this view, self-attention implements a gradient step of an interaction energy, while MLP layers correspond to gradient updates of a potential energy. Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie--Trotter splitting between these two energy functionals. This perspective enables principled architectural design using classical optimization ideas. As a proof of concept, we introduce a Nesterov-style accelerated transformer that preserves the same attention and MLP oracles. The resulting architecture consistently outperforms a nanoGPT baseline on TinyStories and OpenWebText, demonstrating that optimization-theoretic insights can translate into practical gains.
Related papers
- A Constrained Optimization Perspective of Unrolled Transformers [77.12297732942095]
We introduce a constrained optimization framework for training transformers that behave like optimization descent algorithms.<n>We observe constrained transformers achieve stronger to perturbations robustness and maintain higher out-of-distribution generalization.
arXiv Detail & Related papers (2026-01-24T02:12:39Z) - Rethinking Vision Transformer Depth via Structural Reparameterization [16.12815682992294]
We propose a branch-based structural reparameterization technique that operates during the training phase.<n>Our approach leverages parallel branches within transformer blocks that can be systematically consolidated into streamlined single-path models.<n>When applied to ViT-Tiny, the framework successfully reduces the original 12-layer architecture to 6, 4, or as few as 3 layers while maintaining classification accuracy on ImageNet-1K.
arXiv Detail & Related papers (2025-11-24T21:28:55Z) - Wavy Transformer [5.4806374384787695]
We propose Wavy Transformer, which consists of a novel attention layer based on second-order wavy dynamics.<n>We also introduce a feed-forward network and a normalization layer designed to preserve the physical state-velocity relationship under the chain rule.
arXiv Detail & Related papers (2025-08-18T10:03:38Z) - Plain Transformers Can be Powerful Graph Learners [64.50059165186701]
Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers have strayed far from plain Transformers.<n>This work demonstrates that the plain Transformer architecture can be a powerful graph learner.
arXiv Detail & Related papers (2025-04-17T02:06:50Z) - Hyper-SET: Designing Transformers via Hyperspherical Energy Minimization [32.04194224236952]
We formalize token dynamics as a joint maximum likelihood estimation on the hypersphere.<n>We present textitHyper-Spherical Energy Transformer (Hyper-SET), a recurrent-depth alternative to vanilla Transformers.
arXiv Detail & Related papers (2025-02-17T10:39:11Z) - Towards Principled Graph Transformers [8.897857788525629]
Graph learning architectures based on the k-dimensional Weisfeiler-Leman (k-WL) hierarchy offer a theoretically well-understood expressive power.
We show that the proposed Edge Transformer, a global attention model operating on node pairs instead of nodes, has at least 3-WL expressive power.
arXiv Detail & Related papers (2024-01-18T16:50:55Z) - 2-D SSM: A General Spatial Layer for Visual Transformers [79.4957965474334]
A central objective in computer vision is to design models with appropriate 2-D inductive bias.
We leverage an expressive variation of the multidimensional State Space Model.
Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme.
arXiv Detail & Related papers (2023-06-11T09:41:37Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - Transformers from an Optimization Perspective [24.78739299952529]
We study the problem of finding an energy function underlying the Transformer model.
By finding such a function, we can reinterpret Transformers as the unfolding of an interpretable optimization process.
This work contributes to our intuition and understanding of Transformers, while potentially laying the ground-work for new model designs.
arXiv Detail & Related papers (2022-05-27T10:45:15Z) - Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks.
Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.
In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.