Related papers: Attention Is Not What You Need

Attention Is Not What You Need

URL: http://arxiv.org/abs/2512.19428v1
Date: Mon, 22 Dec 2025 14:29:18 GMT
Title: Attention Is Not What You Need
Authors: Zhang Chong,
Abstract summary: We argue that standard multi-head attention is best seen as a form of tensor lifting.<n>We propose an attention-free architecture based on Grassmann flows.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We revisit a basic question in sequence modeling: is explicit self-attention actually necessary for strong performance and reasoning? We argue that standard multi-head attention is best seen as a form of tensor lifting: hidden vectors are mapped into a high-dimensional space of pairwise interactions, and learning proceeds by constraining this lifted tensor through gradient descent. This mechanism is extremely expressive but mathematically opaque, because after many layers it becomes very hard to describe the model with a small family of explicit invariants. To explore an alternative, we propose an attention-free architecture based on Grassmann flows. Instead of forming an L by L attention matrix, our Causal Grassmann layer (i) linearly reduces token states, (ii) encodes local token pairs as two-dimensional subspaces on a Grassmann manifold via Plucker coordinates, and (iii) fuses these geometric features back into the hidden states through gated mixing. Information therefore propagates by controlled deformations of low-rank subspaces over multi-scale local windows, so the core computation lives on a finite-dimensional manifold rather than in an unstructured tensor space. On the Wikitext-2 language modeling benchmark, purely Grassmann-based models with 13 to 18 million parameters achieve validation perplexities within about 10 to 15 percent of size-matched Transformers. On the SNLI natural language inference task, a Grassmann-Plucker head on top of DistilBERT slightly outperforms a Transformer head, with best validation and test accuracies of 0.8550 and 0.8538 compared to 0.8545 and 0.8511. We analyze the complexity of Grassmann mixing, show linear scaling in sequence length for fixed rank, and argue that such manifold-based designs offer a more structured route toward geometric and invariant-based interpretations of neural reasoning.

Related papers

Sparse Semantic Dimension as a Generalization Certificate for LLMs [53.681678236115836]
We introduce the Sparse Semantic Dimension (SSD), a complexity measure derived from the active feature vocabulary of a Sparse Autoencoder (SAE) trained on the model's layers.<n>We validate this framework on GPT-2 Small and Gemma-2B, demonstrating that our bound provides non-vacuous certificates at realistic sample sizes.
arXiv Detail & Related papers (2026-02-11T21:45:18Z)
Low-Dimensional Execution Manifolds in Transformer Learning Dynamics: Evidence from Modular Arithmetic Tasks [0.0]
We investigate the structure of learning dynamics in transformer models through carefully controlled arithmetic tasks.<n>Our results suggest a unifying geometric framework for understanding transformer learning.
arXiv Detail & Related papers (2026-02-11T03:57:46Z)
Inverting Self-Organizing Maps: A Unified Activation-Based Framework [39.146761527401424]
We show that the activation pattern of a SOM can be inverted to recover the exact input under mild geometric conditions.<n>We introduce the Manifold-Aware Unified SOM Inversion and Control (MUSIC) update rule.<n>We validate the approach using synthetic Gaussian mixtures, the MNIST and the Faces in the Wild dataset.
arXiv Detail & Related papers (2026-01-20T11:02:54Z)
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders [34.99839291352472]
Multilayer perceptrons (MLPs) are integral part of large language models.<n>Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping.<n>In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse approximation.
arXiv Detail & Related papers (2025-05-27T15:55:55Z)
Connecting Parameter Magnitudes and Hessian Eigenspaces at Scale using Sketched Methods [22.835933033524718]
We develop a methodology to measure the similarity between arbitrary parameter masks and Hessian eigenspaces via Grassmannian metrics.<n>Our experiments reveal an *overlap* between magnitude parameter masks and top Hessian eigenspaces consistently higher than chance-level.<n>Our work provides a methodology to approximate and analyze deep learning Hessians at scale, as well as a novel insight on the structure of their eigenspace.
arXiv Detail & Related papers (2025-04-20T18:29:39Z)
TensorGRaD: Tensor Gradient Robust Decomposition for Memory-Efficient Neural Operator Training [91.8932638236073]
We introduce textbfTensorGRaD, a novel method that directly addresses the memory challenges associated with large-structured weights.<n>We show that sparseGRaD reduces total memory usage by over $50%$ while maintaining and sometimes even improving accuracy.
arXiv Detail & Related papers (2025-01-04T20:51:51Z)
Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors. We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z)
From Semantics to Hierarchy: A Hybrid Euclidean-Tangent-Hyperbolic Space Model for Temporal Knowledge Graph Reasoning [1.1372536310854844]
Temporal knowledge graph (TKG) reasoning predicts future events based on historical data. Existing Euclidean models excel at capturing semantics but struggle with hierarchy. We propose a novel hybrid geometric space approach that leverages the strengths of both Euclidean and hyperbolic models.
arXiv Detail & Related papers (2024-08-30T10:33:08Z)
Efficient Long Sequence Modeling via State Space Augmented Transformer [92.74707853711374]
We propose SPADE, short for $underlinetextbfS$tate sunderlinetextbfP$ace. We augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers. Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-15T20:51:27Z)
Decoupled Multi-task Learning with Cyclical Self-Regulation for Face Parsing [71.19528222206088]
We propose a novel Decoupled Multi-task Learning with Cyclical Self-Regulation for face parsing. Specifically, DML-CSR designs a multi-task model which comprises face parsing, binary edge, and category edge detection. Our method achieves the new state-of-the-art performance on the Helen, CelebA-HQ, and LapaMask datasets.
arXiv Detail & Related papers (2022-03-28T02:12:30Z)
DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation. We propose to leverage the Transformer to model this global context with an effective attention mechanism. Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z)
2D+3D facial expression recognition via embedded tensor manifold regularization [16.98176664818354]
A novel approach via embedded tensor manifold regularization for 2D+3D facial expression recognition (FERETMR) is proposed. We establish the first-order optimality condition in terms of stationary points, and then design a block coordinate descent (BCD) algorithm with convergence analysis. Numerical results on BU-3DFE database and Bosphorus databases demonstrate the effectiveness of our proposed approach.
arXiv Detail & Related papers (2022-01-29T06:11:00Z)
NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go [109.88509362837475]
We present NeuroMorph, a new neural network architecture that takes as input two 3D shapes. NeuroMorph produces smooth and point-to-point correspondences between them. It works well for a large variety of input shapes, including non-isometric pairs from different object categories.
arXiv Detail & Related papers (2021-06-17T12:25:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.