Apriel-H1: Towards Efficient Enterprise Reasoning Models
- URL: http://arxiv.org/abs/2511.02651v1
- Date: Tue, 04 Nov 2025 15:17:43 GMT
- Title: Apriel-H1: Towards Efficient Enterprise Reasoning Models
- Authors: Oleksiy Ostapenko, Luke Kumar, Raymond Li, Denis Kocetkov, Joel Lamy-Poirier, Shruthan Radhakrishna, Soham Parikh, Shambhavi Mishra, Sebastien Paquet, Srinivas Sunkara, Valérie Bécaert, Sathwik Tejaswi Madhusudhan, Torsten Scholak,
- Abstract summary: Apriel-H1 family of hybrid LLMs combine transformer attention and SSM sequence mixers for efficient reasoning at 15B model size.<n>We release multiple post-distillation variants of Apriel-H1-15B-Thinker with different SSM-to-MHA ratios and analyse how reasoning performance degrades as more Mamba layers replace MHA.
- Score: 6.630534140883356
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) achieve remarkable reasoning capabilities through transformer architectures with attention mechanisms. However, transformers suffer from quadratic time and memory complexity in the attention module (MHA) and require caching key-value states during inference, which severely limits throughput and scalability. High inference throughput is critical for agentic tasks, long-context reasoning, efficient deployment under high request loads, and more efficient test-time compute scaling. State Space Models (SSMs) such as Mamba offer a promising alternative with linear inference complexity and a constant memory footprint via recurrent computation with fixed-size hidden states. In this technical report we introduce the Apriel-H1 family of hybrid LLMs that combine transformer attention and SSM sequence mixers for efficient reasoning at 15B model size. These models are obtained through incremental distillation from a pretrained reasoning transformer, Apriel-Nemotron-15B-Thinker, progressively replacing less critical attention layers with linear Mamba blocks. We release multiple post-distillation variants of Apriel-H1-15B-Thinker with different SSM-to-MHA ratios and analyse how reasoning performance degrades as more Mamba layers replace MHA. Additionally, we release a 30/50 hybrid variant of Apriel-H1, further fine-tuned on a supervised dataset of reasoning traces, achieving over 2x higher inference throughput when deployed in the production-ready vLLM environment, with minimal degradation in reasoning performance. This shows that distilled hybrid SSM-Transformer architectures can deliver substantial efficiency gains over the pretrained transformer equivalent without substantially compromising the reasoning quality.
Related papers
- MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling [80.48332380100915]
MiniCPM-SALA is a hybrid model that integrates the high-fidelity long-context modeling of sparse attention with the global efficiency of linear attention.<n>On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens.
arXiv Detail & Related papers (2026-02-12T09:37:05Z) - The New LLM Bottleneck: A Systems Perspective on Latent Attention and Mixture-of-Experts [5.10053312713569]
This paper argues that recent architectural shifts, namely Multi-head Latent Attention (MLA) and Mixture-of-Experts (MoE), challenge the premise of specialized attention hardware.<n>The central challenge for next-generation Transformers is no longer accelerating single memory-bound layer.<n>Instead, the focus must shift to designing balanced systems with sufficient compute memory capacity, memory bandwidth, and high-bandwidth interconnects to manage the diverse demands of large-scale models.
arXiv Detail & Related papers (2025-07-21T10:18:33Z) - Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving [4.429810985618279]
Transformers are the driving force behind today's Large Language Models (LLMs), serving as the foundation for their performance and versatility.<n>In response, the algorithm community is exploring alternative architectures, such as state space models (SSMs), linear attention, and recurrent neural networks (RNNs)
arXiv Detail & Related papers (2025-07-14T11:40:17Z) - Routing Mamba: Scaling State Space Models with Mixture-of-Experts Projection [88.47928738482719]
Linear State Space Models (SSMs) offer remarkable performance gains in sequence modeling.<n>Recent advances, such as Mamba, further enhance SSMs with input-dependent gating and hardware-aware implementations.<n>We introduce Routing Mamba (RoM), a novel approach that scales SSM parameters using sparse mixtures of linear projection experts.
arXiv Detail & Related papers (2025-06-22T19:26:55Z) - Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z) - M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models [83.77063985611846]
We introduce a novel hybrid linear RNN reasoning model, M1, built on the Mamba architecture.<n> Experimental results show that M1 not only outperforms previous linear RNN models but also matches the performance of state-of-the-art DeepSeek R1 distilled reasoning models.
arXiv Detail & Related papers (2025-04-14T17:38:25Z) - Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners [72.37408197157453]
Recent advancements have demonstrated that the performance of large language models (LLMs) can be significantly enhanced by scaling computational resources at test time.<n>This raises a fundamental question: can models with lower complexity leverage their superior generation throughput to outperform similarly sized Transformers for a fixed computational budget?<n>To address this question and overcome the lack of strong subquadratic reasoners, we distill pure and hybrid Mamba models from pretrained Transformers.
arXiv Detail & Related papers (2025-02-27T18:08:16Z) - Tensor Product Attention Is All You Need [61.3442269053374]
Product Attention (TPA) is a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly.<n>TPA achieves improved model quality alongside memory efficiency.<n>Based on TPA, we introduce the ProducT ATTion Transformer (T6), a new model architecture for sequence modeling.
arXiv Detail & Related papers (2025-01-11T03:37:10Z) - LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation [37.21518386315535]
Scaling language models to handle longer contexts introduces substantial memory challenges.<n>We propose LightTransfer, a method that transforms models such as LLaMA into hybrid variants.<n>Our approach identifies lazy layers -- those focusing on recent or initial tokens -- and replaces their full attention with streaming attention.
arXiv Detail & Related papers (2024-10-17T17:58:14Z) - Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation [15.35494431928751]
Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving.<n>We introduce model-attention disaggregation to enhance the efficiency of LLM decoding.<n>We develop and deploy Lamina, an LLM inference system that incorporates model-attention disaggregation in a distributed heterogeneous cluster.
arXiv Detail & Related papers (2024-05-03T02:15:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.