Scaling Reasoning without Attention
- URL: http://arxiv.org/abs/2505.22425v1
- Date: Wed, 28 May 2025 14:52:15 GMT
- Title: Scaling Reasoning without Attention
- Authors: Xueliang Zhao, Wei Wu, Lingpeng Kong,
- Abstract summary: We introduce ourmodel, an attention-free language model that addresses both issues through architectural and data-centric innovations.<n>Our model eliminates the need for self-attention and key-value caching, enabling fixed-memory, constant-time inference.<n>On benchmark evaluations, ourmodel-7B outperforms strong Transformer and hybrid models of comparable scale.
- Score: 44.42046576158219
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have made significant advances in complex reasoning tasks, yet they remain bottlenecked by two core challenges: architectural inefficiency due to reliance on Transformers, and a lack of structured fine-tuning for high-difficulty domains. We introduce \ourmodel, an attention-free language model that addresses both issues through architectural and data-centric innovations. Built on the state space dual (SSD) layers of Mamba-2, our model eliminates the need for self-attention and key-value caching, enabling fixed-memory, constant-time inference. To train it for complex reasoning, we propose a two-phase curriculum fine-tuning strategy based on the \textsc{PromptCoT} synthesis paradigm, which generates pedagogically structured problems via abstract concept selection and rationale-guided generation. On benchmark evaluations, \ourmodel-7B outperforms strong Transformer and hybrid models of comparable scale, and even surpasses the much larger Gemma3-27B by 2.6\% on AIME 24, 0.6\% on AIME 25, and 3.0\% on Livecodebench. These results highlight the potential of state space models as efficient and scalable alternatives to attention-based architectures for high-capacity reasoning.
Related papers
- Scaling Linear Attention with Sparse State Expansion [58.161410995744596]
Transformer architecture struggles with long-context scenarios due to quadratic computation and linear memory growth.<n>We introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification.<n>Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions.
arXiv Detail & Related papers (2025-07-22T13:27:31Z) - C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning [78.36259648527401]
C2-Evo is an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities.<n>We show that C2-Evo consistently obtains considerable performance gains across multiple mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-07-22T12:27:08Z) - Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate [0.0]
This paper explores an alternative, constructive approach to model development, built upon the foundation of non-trainable, deterministic input embeddings.<n>We show that specialist models trained on disparate datasets can be merged into a single, more capable Mixture-of-Experts model.<n>We introduce a layer-wise constructive training methodology, where a deep Transformer is "grown" by progressively stacking and training one layer at a time.
arXiv Detail & Related papers (2025-07-08T20:01:15Z) - Efficient Unstructured Pruning of Mamba State-Space Models for Resource-Constrained Environments [2.1797343876622097]
State-space models (SSMs) have emerged as powerful alternatives to Transformers for sequence modeling.<n>We propose a novel unstructured pruning framework tailored for Mamba models that achieves up to 70% parameter reduction while retaining over 95% of the original performance.
arXiv Detail & Related papers (2025-05-13T07:23:08Z) - Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation [158.37640586809187]
Restoring any degraded image efficiently via just one model has become increasingly significant.<n>Our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations.<n>To fuse the degradation awareness and the contextualized attention, a spatial-frequency parallel fusion strategy is proposed.
arXiv Detail & Related papers (2025-04-19T09:54:46Z) - SegResMamba: An Efficient Architecture for 3D Medical Image Segmentation [2.979183050755201]
We propose an efficient 3D segmentation model for medical imaging called SegResMamba.<n>Our model uses less than half the memory during training compared to other state-of-the-art (SOTA) architectures.
arXiv Detail & Related papers (2025-03-10T18:40:28Z) - ContextFormer: Redefining Efficiency in Semantic Segmentation [48.81126061219231]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships.<n>Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands.<n>We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z) - State Space Models are Strong Text Rerankers [33.41687512973575]
State space models (SSMs) like Mamba offer promising advantages.<n>Despite their potential, SSMs' effectiveness at text reranking remains underexplored.<n>Mamba architectures achieve competitive text ranking performance, comparable to transformer-based models of similar size.
arXiv Detail & Related papers (2024-12-18T21:42:15Z) - Restore Anything Model via Efficient Degradation Adaptation [129.38475243424563]
RAM takes a unified path that leverages inherent similarities across various degradations to enable efficient and comprehensive restoration.<n> RAM's SOTA performance confirms RAM's SOTA performance, reducing model complexity by approximately 82% in trainable parameters and 85% in FLOPs.
arXiv Detail & Related papers (2024-07-18T10:26:53Z) - Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality [31.985243136674146]
We show that state-space models (SSMs) such as Mamba have been shown to match or outperform Transformers at small to medium scale.
Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster.
arXiv Detail & Related papers (2024-05-31T17:50:01Z) - Hiformer: Heterogeneous Feature Interactions Learning with Transformers
for Recommender Systems [27.781785405875084]
We propose to leverage a Transformer-based architecture with attention layers to automatically capture feature interactions.
We identify two key challenges for applying the vanilla Transformer architecture to web-scale recommender systems.
arXiv Detail & Related papers (2023-11-10T05:57:57Z) - Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained
Language Models [68.9288651177564]
We present a novel MoE architecture based on matrix product operators (MPO) from quantum many-body physics.
With the decomposed MPO structure, we can reduce the parameters of the original MoE architecture.
Experiments on the three well-known downstream natural language datasets based on GPT2 show improved performance and efficiency in increasing model capacity.
arXiv Detail & Related papers (2022-03-02T13:44:49Z) - Discrete Auto-regressive Variational Attention Models for Text Modeling [53.38382932162732]
Variational autoencoders (VAEs) have been widely applied for text modeling.
They are troubled by two challenges: information underrepresentation and posterior collapse.
We propose Discrete Auto-regressive Variational Attention Model (DAVAM) to address the challenges.
arXiv Detail & Related papers (2021-06-16T06:36:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.