Related papers: YuriiFormer: A Suite of Nesterov-Accelerated Transformers

YuriiFormer: A Suite of Nesterov-Accelerated Transformers

URL: http://arxiv.org/abs/2601.23236v1
Date: Fri, 30 Jan 2026 18:06:21 GMT
Title: YuriiFormer: A Suite of Nesterov-Accelerated Transformers
Authors: Aleksandr Zimin, Yury Polyanskiy, Philippe Rigollet,
Abstract summary: We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings.<n>In this view, self-attention implements gradient step of an interaction energy, while layers correspond to gradient updates of a potential energy.<n>Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie-Trotter splitting between these two energys.
Score: 62.40952219538543
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings. In this view, self-attention implements a gradient step of an interaction energy, while MLP layers correspond to gradient updates of a potential energy. Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie--Trotter splitting between these two energy functionals. This perspective enables principled architectural design using classical optimization ideas. As a proof of concept, we introduce a Nesterov-style accelerated transformer that preserves the same attention and MLP oracles. The resulting architecture consistently outperforms a nanoGPT baseline on TinyStories and OpenWebText, demonstrating that optimization-theoretic insights can translate into practical gains.

Related papers

A Constrained Optimization Perspective of Unrolled Transformers [77.12297732942095]
We introduce a constrained optimization framework for training transformers that behave like optimization descent algorithms.<n>We observe constrained transformers achieve stronger to perturbations robustness and maintain higher out-of-distribution generalization.
arXiv Detail & Related papers (2026-01-24T02:12:39Z)
Rethinking Vision Transformer Depth via Structural Reparameterization [16.12815682992294]
We propose a branch-based structural reparameterization technique that operates during the training phase.<n>Our approach leverages parallel branches within transformer blocks that can be systematically consolidated into streamlined single-path models.<n>When applied to ViT-Tiny, the framework successfully reduces the original 12-layer architecture to 6, 4, or as few as 3 layers while maintaining classification accuracy on ImageNet-1K.
arXiv Detail & Related papers (2025-11-24T21:28:55Z)
Wavy Transformer [5.4806374384787695]
We propose Wavy Transformer, which consists of a novel attention layer based on second-order wavy dynamics.<n>We also introduce a feed-forward network and a normalization layer designed to preserve the physical state-velocity relationship under the chain rule.
arXiv Detail & Related papers (2025-08-18T10:03:38Z)
Plain Transformers Can be Powerful Graph Learners [64.50059165186701]
Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers have strayed far from plain Transformers.<n>This work demonstrates that the plain Transformer architecture can be a powerful graph learner.
arXiv Detail & Related papers (2025-04-17T02:06:50Z)
Hyper-SET: Designing Transformers via Hyperspherical Energy Minimization [32.04194224236952]
We formalize token dynamics as a joint maximum likelihood estimation on the hypersphere.<n>We present textitHyper-Spherical Energy Transformer (Hyper-SET), a recurrent-depth alternative to vanilla Transformers.
arXiv Detail & Related papers (2025-02-17T10:39:11Z)
Towards Principled Graph Transformers [8.897857788525629]
Graph learning architectures based on the k-dimensional Weisfeiler-Leman (k-WL) hierarchy offer a theoretically well-understood expressive power. We show that the proposed Edge Transformer, a global attention model operating on node pairs instead of nodes, has at least 3-WL expressive power.
arXiv Detail & Related papers (2024-01-18T16:50:55Z)
2-D SSM: A General Spatial Layer for Visual Transformers [79.4957965474334]
A central objective in computer vision is to design models with appropriate 2-D inductive bias. We leverage an expressive variation of the multidimensional State Space Model. Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme.
arXiv Detail & Related papers (2023-06-11T09:41:37Z)
DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency. The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on. Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z)
Transformers from an Optimization Perspective [24.78739299952529]
We study the problem of finding an energy function underlying the Transformer model. By finding such a function, we can reinterpret Transformers as the unfolding of an interpretable optimization process. This work contributes to our intuition and understanding of Transformers, while potentially laying the ground-work for new model designs.
arXiv Detail & Related papers (2022-05-27T10:45:15Z)
Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks. Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows. This design significantly improves the efficiency but lacks global feature reasoning in early stages. In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.