Related papers: Scaling Reasoning without Attention

Scaling Reasoning without Attention

URL: http://arxiv.org/abs/2505.22425v1
Date: Wed, 28 May 2025 14:52:15 GMT
Title: Scaling Reasoning without Attention
Authors: Xueliang Zhao, Wei Wu, Lingpeng Kong,
Abstract summary: We introduce ourmodel, an attention-free language model that addresses both issues through architectural and data-centric innovations.<n>Our model eliminates the need for self-attention and key-value caching, enabling fixed-memory, constant-time inference.<n>On benchmark evaluations, ourmodel-7B outperforms strong Transformer and hybrid models of comparable scale.
Score: 44.42046576158219
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have made significant advances in complex reasoning tasks, yet they remain bottlenecked by two core challenges: architectural inefficiency due to reliance on Transformers, and a lack of structured fine-tuning for high-difficulty domains. We introduce \ourmodel, an attention-free language model that addresses both issues through architectural and data-centric innovations. Built on the state space dual (SSD) layers of Mamba-2, our model eliminates the need for self-attention and key-value caching, enabling fixed-memory, constant-time inference. To train it for complex reasoning, we propose a two-phase curriculum fine-tuning strategy based on the \textsc{PromptCoT} synthesis paradigm, which generates pedagogically structured problems via abstract concept selection and rationale-guided generation. On benchmark evaluations, \ourmodel-7B outperforms strong Transformer and hybrid models of comparable scale, and even surpasses the much larger Gemma3-27B by 2.6\% on AIME 24, 0.6\% on AIME 25, and 3.0\% on Livecodebench. These results highlight the potential of state space models as efficient and scalable alternatives to attention-based architectures for high-capacity reasoning.

Related papers

Scaling Linear Attention with Sparse State Expansion [58.161410995744596]
Transformer architecture struggles with long-context scenarios due to quadratic computation and linear memory growth.<n>We introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification.<n>Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions.
arXiv Detail & Related papers (2025-07-22T13:27:31Z)
C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning [78.36259648527401]
C2-Evo is an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities.<n>We show that C2-Evo consistently obtains considerable performance gains across multiple mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-07-22T12:27:08Z)
Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate [0.0]
This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings.<n>We show that specialist models trained on disparate datasets can be merged into a single, more capable Mixture-of-Experts model.<n>We introduce a layer-wise constructive training methodology, where a deep Transformer is "grown" by progressively stacking and training one layer at a time.
arXiv Detail & Related papers (2025-07-08T20:01:15Z)
Efficient Unstructured Pruning of Mamba State-Space Models for Resource-Constrained Environments [2.1797343876622097]
State-space models (SSMs) have emerged as powerful alternatives to Transformers for sequence modeling.<n>We propose a novel unstructured pruning framework tailored for Mamba models that achieves up to 70% parameter reduction while retaining over 95% of the original performance.
arXiv Detail & Related papers (2025-05-13T07:23:08Z)
Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation [158.37640586809187]
Restoring any degraded image efficiently via just one model has become increasingly significant.<n>Our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations.<n>To fuse the degradation awareness and the contextualized attention, a spatial-frequency parallel fusion strategy is proposed.
arXiv Detail & Related papers (2025-04-19T09:54:46Z)
SegResMamba: An Efficient Architecture for 3D Medical Image Segmentation [2.979183050755201]
We propose an efficient 3D segmentation model for medical imaging called SegResMamba.<n>Our model uses less than half the memory during training compared to other state-of-the-art (SOTA) architectures.
arXiv Detail & Related papers (2025-03-10T18:40:28Z)
ContextFormer: Redefining Efficiency in Semantic Segmentation [48.81126061219231]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships.<n>Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands.<n>We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z)
State Space Models are Strong Text Rerankers [33.41687512973575]
State space models (SSMs) like Mamba offer promising advantages.<n>Despite their potential, SSMs' effectiveness at text reranking remains underexplored.<n>Mamba architectures achieve competitive text ranking performance, comparable to transformer-based models of similar size.
arXiv Detail & Related papers (2024-12-18T21:42:15Z)
Restore Anything Model via Efficient Degradation Adaptation [129.38475243424563]
RAM takes a unified path that leverages inherent similarities across various degradations to enable efficient and comprehensive restoration.<n> RAM's SOTA performance confirms RAM's SOTA performance, reducing model complexity by approximately 82% in trainable parameters and 85% in FLOPs.
arXiv Detail & Related papers (2024-07-18T10:26:53Z)
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality [31.985243136674146]
We show that state-space models (SSMs) such as Mamba have been shown to match or outperform Transformers at small to medium scale. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster.
arXiv Detail & Related papers (2024-05-31T17:50:01Z)
Hiformer: Heterogeneous Feature Interactions Learning with Transformers for Recommender Systems [27.781785405875084]
We propose to leverage a Transformer-based architecture with attention layers to automatically capture feature interactions. We identify two key challenges for applying the vanilla Transformer architecture to web-scale recommender systems.
arXiv Detail & Related papers (2023-11-10T05:57:57Z)
Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models [68.9288651177564]
We present a novel MoE architecture based on matrix product operators (MPO) from quantum many-body physics. With the decomposed MPO structure, we can reduce the parameters of the original MoE architecture. Experiments on the three well-known downstream natural language datasets based on GPT2 show improved performance and efficiency in increasing model capacity.
arXiv Detail & Related papers (2022-03-02T13:44:49Z)
Discrete Auto-regressive Variational Attention Models for Text Modeling [53.38382932162732]
Variational autoencoders (VAEs) have been widely applied for text modeling. They are troubled by two challenges: information underrepresentation and posterior collapse. We propose Discrete Auto-regressive Variational Attention Model (DAVAM) to address the challenges.
arXiv Detail & Related papers (2021-06-16T06:36:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.