Related papers: Fast and Simplex: 2-Simplicial Attention in Triton

Fast and Simplex: 2-Simplicial Attention in Triton

URL: http://arxiv.org/abs/2507.02754v1
Date: Thu, 03 Jul 2025 16:16:34 GMT
Title: Fast and Simplex: 2-Simplicial Attention in Triton
Authors: Aurko Roy, Timothy Chou, Sai Surya Duvvuri, Sijia Chen, Jiecao Yu, Xiaodong Wang, Manzil Zaheer, Rohan Anil,
Abstract summary: We investigate the use of the 2-simplicial Transformer, an architecture that generalizes standard dot-product attention to trilinear functions.<n>For a fixed token budget, similarly sized models outperform their dot-product counterparts on tasks involving mathematics, coding, reasoning, and logic.
Score: 30.500203127313576
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count together. However, these scaling laws assume an infinite supply of data and apply primarily in compute-bound settings. As modern large language models increasingly rely on massive internet-scale datasets, the assumption that they are compute-bound is becoming less valid. This shift highlights the need for architectures that prioritize token efficiency. In this work, we investigate the use of the 2-simplicial Transformer, an architecture that generalizes standard dot-product attention to trilinear functions through an efficient Triton kernel implementation. We demonstrate that the 2-simplicial Transformer achieves better token efficiency than standard Transformers: for a fixed token budget, similarly sized models outperform their dot-product counterparts on tasks involving mathematics, coding, reasoning, and logic. We quantify these gains by demonstrating that $2$-simplicial attention changes the exponent in the scaling laws for knowledge and reasoning tasks compared to dot product attention.

Related papers

Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality [29.531450446701175]
This paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models.<n>We argue that token reduction can facilitate deeper multimodal integration and alignment, maintain coherence over long inputs, and enhance training stability.<n>We outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, and broader ML and scientific domains.
arXiv Detail & Related papers (2025-05-23T11:30:30Z)
Plain Transformers Can be Powerful Graph Learners [64.50059165186701]
Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers have strayed far from plain Transformers.<n>This work demonstrates that the plain Transformer architecture can be a powerful graph learner.
arXiv Detail & Related papers (2025-04-17T02:06:50Z)
Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction [29.12836710966048]
We propose a novel transformer attention operator whose computational complexity scales linearly with the number of tokens.<n>Our results call into question the conventional wisdom that pairwise similarity style attention mechanisms are critical to the success of transformer architectures.
arXiv Detail & Related papers (2024-12-23T18:59:21Z)
Higher Order Transformers: Efficient Attention Mechanism for Tensor Structured Data [10.327160288730125]
Higher-Order Transformers (HOT) are designed to process data with more than two axes, i.e. higher-order tensors.<n>To address the computational challenges associated with high-order tensor attention, we introduce a novel Kronecker factorized attention mechanism.<n>We validate the effectiveness of HOT on two high-dimensional tasks, including multivariate time series forecasting, and 3D medical image classification.
arXiv Detail & Related papers (2024-12-04T00:10:47Z)
Towards Neural Scaling Laws for Time Series Foundation Models [63.5211738245487]
We examine two common TSFM architectures, encoder-only and decoder-only Transformers, and investigate their scaling behavior on both ID and OOD data.<n>Our experiments reveal that the log-likelihood loss of TSFMs exhibits similar scaling behavior in both OOD and ID settings.<n>We provide practical guidelines for designing and scaling larger TSFMs with enhanced model capabilities.
arXiv Detail & Related papers (2024-10-16T08:23:39Z)
Kolmogorov-Arnold Transformer [72.88137795439407]
We introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces layers with Kolmogorov-Arnold Network (KAN) layers. We identify three key challenges: (C1) Base function, (C2) Inefficiency, and (C3) Weight. With these designs, KAT outperforms traditional-based transformers.
arXiv Detail & Related papers (2024-09-16T17:54:51Z)
Efficient Transformer Encoders for Mask2Former-style models [57.54752243522298]
ECO-M2F is a strategy to self-select the number of hidden layers in the encoder conditioned on the input image. The proposed approach reduces expected encoder computational cost while maintaining performance. It is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
arXiv Detail & Related papers (2024-04-23T17:26:34Z)
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing [18.673619610942197]
Modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize. We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op" or just a partial update of the residual. We propose two simple (independent) modifications to the attention mechanism - clipped softmax and gated attention.
arXiv Detail & Related papers (2023-06-22T14:39:04Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models. We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.