Related papers: How Smoothing is N-simplicial Attention?

How Smoothing is N-simplicial Attention?

URL: http://arxiv.org/abs/2512.15600v1
Date: Wed, 17 Dec 2025 17:10:57 GMT
Title: How Smoothing is N-simplicial Attention?
Authors: Alexandre Dussolle, Pietro Liò,
Abstract summary: We introduce N-simplicial attention, going from pairwise token similarity to higher-order interactions, and adapt it for Rotary Position Embeddings (RoPE)<n>To help manage the increased complexity, we propose a cost-effective simplex selection enabling the model to focus its computation load onto the more task-sensitive interactions.
Score: 57.21791642118324
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Going from pure Multilayer Perceptron (MLP) to a learnable graph message-passing mechanism at each layer has been foundational to state-of-the-art results, despite the computational trade-off (e.g. GATs or Transformers). To go a step further, in this work, we introduce N-simplicial attention, going from pairwise token similarity to higher-order interactions, and adapt it for Rotary Position Embeddings (RoPE). To help manage the increased complexity, we propose a cost-effective simplex selection enabling the model to focus its computation load onto the more task-sensitive interactions. Beyond these core mechanisms, we study how smoothing N-simplicial attention is by deriving a Lipschitz upper-bound and by demonstrating that by itself it also suffers from over-smoothing, despite opening the attention message-passing to higher-order interactions.

Related papers

Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention [105.11288339285154]
Efficient-LVSM is a dual-stream architecture that applies intra-view self-attention for input views and self-then-cross attention for target views.<n>It achieves 29.86 dB PSNR on RealEstate10K with 2 input views, surpassing LVSM by 0.2 dB, with 2x faster training convergence and 4.4x faster inference speed.
arXiv Detail & Related papers (2026-02-06T08:11:58Z)
MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-k Activations [11.032826710593632]
In Transformers, the expressive capacity of an N-width increases, but scaling its fast weights becomes expensive for extremely long sequences.<n>Recently, this fast-weight scaling perspective has motivated the Mixture-of-Experts (MoE) attention, which partitions the sequence into fast-weight experts and sparsely routes the tokens to them.<n>In this paper, we elevate this perspective to a unifying framework for a wide range of efficient attention methods by interpreting them as scaling fast weights through routing and compressor compression.
arXiv Detail & Related papers (2026-02-01T13:21:53Z)
FuXi-β: Towards a Lightweight and Fast Large-Scale Generative Recommendation Model [87.38823851271758]
We propose a new framework for Transformer-like recommendation models.<n>FuXi-$beta$ outperforms previous state-of-the-art models and achieves significant acceleration.<n>Our code is available in a public repository: https://github.com/USTC-StarTeam/FuXi-beta$.
arXiv Detail & Related papers (2025-08-14T13:12:29Z)
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z)
MLP Can Be A Good Transformer Learner [73.01739251050076]
Self-attention mechanism is the key of the Transformer but often criticized for its computation demands. This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers.
arXiv Detail & Related papers (2024-04-08T16:40:15Z)
FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification. Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z)
Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks. They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers. We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z)
Rethinking Query-Key Pairwise Interactions in Vision Transformers [5.141895475956681]
We propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights. We develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark.
arXiv Detail & Related papers (2022-07-01T03:36:49Z)
Towards Joint Intent Detection and Slot Filling via Higher-order Attention [47.78365472691051]
Intent detection (ID) and Slot filling (SF) are two major tasks in spoken language understanding (SLU) We propose a Bilinear attention block, which exploits both the contextual and channel-wise bilinear attention distributions. We show that our approach yields improvements compared with the state-of-the-art approach.
arXiv Detail & Related papers (2021-09-18T09:50:23Z)
GraphFM: Graph Factorization Machines for Feature Interaction Modeling [27.307086868266012]
We propose a novel approach, Graph Factorization Machine (GraphFM), by naturally representing features in the graph structure.<n>In particular, we design a mechanism to select the beneficial feature interactions and formulate them as edges between features.<n>The proposed model integrates the interaction function of FM into the feature aggregation strategy of Graph Neural Network (GNN)
arXiv Detail & Related papers (2021-05-25T12:10:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.