How Smoothing is N-simplicial Attention?
- URL: http://arxiv.org/abs/2512.15600v1
- Date: Wed, 17 Dec 2025 17:10:57 GMT
- Title: How Smoothing is N-simplicial Attention?
- Authors: Alexandre Dussolle, Pietro Liò,
- Abstract summary: We introduce N-simplicial attention, going from pairwise token similarity to higher-order interactions, and adapt it for Rotary Position Embeddings (RoPE)<n>To help manage the increased complexity, we propose a cost-effective simplex selection enabling the model to focus its computation load onto the more task-sensitive interactions.
- Score: 57.21791642118324
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Going from pure Multilayer Perceptron (MLP) to a learnable graph message-passing mechanism at each layer has been foundational to state-of-the-art results, despite the computational trade-off (e.g. GATs or Transformers). To go a step further, in this work, we introduce N-simplicial attention, going from pairwise token similarity to higher-order interactions, and adapt it for Rotary Position Embeddings (RoPE). To help manage the increased complexity, we propose a cost-effective simplex selection enabling the model to focus its computation load onto the more task-sensitive interactions. Beyond these core mechanisms, we study how smoothing N-simplicial attention is by deriving a Lipschitz upper-bound and by demonstrating that by itself it also suffers from over-smoothing, despite opening the attention message-passing to higher-order interactions.
Related papers
- Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention [105.11288339285154]
Efficient-LVSM is a dual-stream architecture that applies intra-view self-attention for input views and self-then-cross attention for target views.<n>It achieves 29.86 dB PSNR on RealEstate10K with 2 input views, surpassing LVSM by 0.2 dB, with 2x faster training convergence and 4.4x faster inference speed.
arXiv Detail & Related papers (2026-02-06T08:11:58Z) - MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-k Activations [11.032826710593632]
In Transformers, the expressive capacity of an N-width increases, but scaling its fast weights becomes expensive for extremely long sequences.<n>Recently, this fast-weight scaling perspective has motivated the Mixture-of-Experts (MoE) attention, which partitions the sequence into fast-weight experts and sparsely routes the tokens to them.<n>In this paper, we elevate this perspective to a unifying framework for a wide range of efficient attention methods by interpreting them as scaling fast weights through routing and compressor compression.
arXiv Detail & Related papers (2026-02-01T13:21:53Z) - FuXi-β: Towards a Lightweight and Fast Large-Scale Generative Recommendation Model [87.38823851271758]
We propose a new framework for Transformer-like recommendation models.<n>FuXi-$beta$ outperforms previous state-of-the-art models and achieves significant acceleration.<n>Our code is available in a public repository: https://github.com/USTC-StarTeam/FuXi-beta$.
arXiv Detail & Related papers (2025-08-14T13:12:29Z) - DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - MLP Can Be A Good Transformer Learner [73.01739251050076]
Self-attention mechanism is the key of the Transformer but often criticized for its computation demands.
This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers.
arXiv Detail & Related papers (2024-04-08T16:40:15Z) - FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification.
Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z) - Parameterization of Cross-Token Relations with Relative Positional
Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks.
They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers.
We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z) - Rethinking Query-Key Pairwise Interactions in Vision Transformers [5.141895475956681]
We propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights.
We develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark.
arXiv Detail & Related papers (2022-07-01T03:36:49Z) - Towards Joint Intent Detection and Slot Filling via Higher-order
Attention [47.78365472691051]
Intent detection (ID) and Slot filling (SF) are two major tasks in spoken language understanding (SLU)
We propose a Bilinear attention block, which exploits both the contextual and channel-wise bilinear attention distributions.
We show that our approach yields improvements compared with the state-of-the-art approach.
arXiv Detail & Related papers (2021-09-18T09:50:23Z) - GraphFM: Graph Factorization Machines for Feature Interaction Modeling [27.307086868266012]
We propose a novel approach, Graph Factorization Machine (GraphFM), by naturally representing features in the graph structure.<n>In particular, we design a mechanism to select the beneficial feature interactions and formulate them as edges between features.<n>The proposed model integrates the interaction function of FM into the feature aggregation strategy of Graph Neural Network (GNN)
arXiv Detail & Related papers (2021-05-25T12:10:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.