Related papers: Hide & Seek: Transformer Symmetries Obscure Sharpness & Riemannian Geometry Finds It

Hide & Seek: Transformer Symmetries Obscure Sharpness & Riemannian Geometry Finds It

URL: http://arxiv.org/abs/2505.05409v1
Date: Thu, 08 May 2025 16:51:03 GMT
Title: Hide & Seek: Transformer Symmetries Obscure Sharpness & Riemannian Geometry Finds It
Authors: Marvin F. da Silva, Felix Dangel, Sageev Oore,
Abstract summary: We argue that existing sharpness measures fail for transformers because they have much richer symmetries in their attention mechanism.<n>We propose a fully general notion of sharpness, in terms of a geodesic ball on the symmetry-corrected quotient manifold.<n>We show that our geodesic sharpness reveals strong correlation for real-world transformers on both text and image classification tasks.
Score: 5.89889361990138
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The concept of sharpness has been successfully applied to traditional architectures like MLPs and CNNs to predict their generalization. For transformers, however, recent work reported weak correlation between flatness and generalization. We argue that existing sharpness measures fail for transformers, because they have much richer symmetries in their attention mechanism that induce directions in parameter space along which the network or its loss remain identical. We posit that sharpness must account fully for these symmetries, and thus we redefine it on a quotient manifold that results from quotienting out the transformer symmetries, thereby removing their ambiguities. Leveraging tools from Riemannian geometry, we propose a fully general notion of sharpness, in terms of a geodesic ball on the symmetry-corrected quotient manifold. In practice, we need to resort to approximating the geodesics. Doing so up to first order yields existing adaptive sharpness measures, and we demonstrate that including higher-order terms is crucial to recover correlation with generalization. We present results on diagonal networks with synthetic data, and show that our geodesic sharpness reveals strong correlation for real-world transformers on both text and image classification tasks.

Related papers

The Inductive Bias of Convolutional Neural Networks: Locality and Weight Sharing Reshape Implicit Regularization [57.37943479039033]
We study how architectural inductive bias reshapes the implicit regularization induced by the edge-of-stability phenomenon in gradient descent.<n>We show that locality and weight sharing fundamentally change this picture.
arXiv Detail & Related papers (2026-03-05T04:50:51Z)
Generalized Linear Mode Connectivity for Transformers [87.32299363530996]
A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths.<n>Prior work has predominantly focused on neuron re-ordering through permutations, but such approaches are limited in scope.<n>We introduce a unified framework that captures four symmetry classes: permutations, semi-permutations, transformations, and general invertible maps.<n>This generalization enables, for the first time, the discovery of low- and zero-barrier linear paths between independently trained Vision Transformers and GPT-2 models.
arXiv Detail & Related papers (2025-06-28T01:46:36Z)
Improving Equivariant Networks with Probabilistic Symmetry Breaking [9.164167226137664]
Equivariant networks encode known symmetries into neural networks, often enhancing generalizations.<n>This poses an important problem, both (1) for prediction tasks on domains where self-symmetries are common, and (2) for generative models, which must break symmetries in order to reconstruct from highly symmetric latent spaces.<n>We present novel theoretical results that establish sufficient conditions for representing such distributions.
arXiv Detail & Related papers (2025-03-27T21:04:49Z)
Finsler Multi-Dimensional Scaling: Manifold Learning for Asymmetric Dimensionality Reduction and Embedding [41.601022263772535]
Dimensionality reduction aims to simplify complex data by reducing its feature dimensionality while preserving essential patterns, with core applications in data analysis and visualisation.<n>To preserve the underlying data structure, multi-dimensional scaling (MDS) methods focus on preserving pairwise dissimilarities, such as distances.
arXiv Detail & Related papers (2025-03-23T10:03:22Z)
Towards Understanding Inductive Bias in Transformers: A View From Infinity [9.00214539845063]
We argue transformers tend to be biased towards more permutation symmetric functions in sequence space. We show that the representation theory of the symmetric group can be used to give quantitative analytical predictions. We argue WikiText dataset, does indeed possess a degree of permutation symmetry.
arXiv Detail & Related papers (2024-02-07T19:00:01Z)
Learning Layer-wise Equivariances Automatically using Gradients [66.81218780702125]
Convolutions encode equivariance symmetries into neural networks leading to better generalisation performance. symmetries provide fixed hard constraints on the functions a network can represent, need to be specified in advance, and can not be adapted. Our goal is to allow flexible symmetry constraints that can automatically be learned from data using gradients.
arXiv Detail & Related papers (2023-10-09T20:22:43Z)
Curve Your Attention: Mixed-Curvature Transformers for Graph Representation Learning [77.1421343649344]
We propose a generalization of Transformers towards operating entirely on the product of constant curvature spaces. We also provide a kernelized approach to non-Euclidean attention, which enables our model to run in time and memory cost linear to the number of nodes and edges.
arXiv Detail & Related papers (2023-09-08T02:44:37Z)
Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem. We characterize the implicit bias of 1-layer transformers optimized with gradient descent. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z)
The Geometry of Neural Nets' Parameter Spaces Under Reparametrization [35.5848464226014]
We study the invariance of neural nets under reparametrization from the perspective of Riemannian geometry. We discuss implications for measuring the flatness of minima, optimization, and for probability-density.
arXiv Detail & Related papers (2023-02-14T22:48:24Z)
Equivariant Mesh Attention Networks [10.517110532297021]
We present an attention-based architecture for mesh data that is provably equivariant to all transformations mentioned above. Our results confirm that our proposed architecture is equivariant, and therefore robust, to these local/global transformations.
arXiv Detail & Related papers (2022-05-21T19:53:14Z)
Boundary theories of critical matchgate tensor networks [59.433172590351234]
Key aspects of the AdS/CFT correspondence can be captured in terms of tensor network models on hyperbolic lattices. For tensors fulfilling the matchgate constraint, these have previously been shown to produce disordered boundary states. We show that these Hamiltonians exhibit multi-scale quasiperiodic symmetries captured by an analytical toy model.
arXiv Detail & Related papers (2021-10-06T18:00:03Z)
Differentiating through the Fr\'echet Mean [51.32291896926807]
Fr'echet mean is a generalization of the Euclidean mean. We show how to differentiate through the Fr'echet mean for arbitrary Riemannian manifold. This fully integrates the Fr'echet mean into the hyperbolic neural network pipeline.
arXiv Detail & Related papers (2020-02-29T19:49:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.