Related papers: Understanding and Enhancing Mamba-Transformer Hybrids for Memory Recall and Language Modeling

Understanding and Enhancing Mamba-Transformer Hybrids for Memory Recall and Language Modeling

URL: http://arxiv.org/abs/2510.26912v1
Date: Thu, 30 Oct 2025 18:19:52 GMT
Title: Understanding and Enhancing Mamba-Transformer Hybrids for Memory Recall and Language Modeling
Authors: Hyunji Lee, Wenhao Yu, Hongming Zhang, Kaixin Ma, Jiyeon Kim, Dong Yu, Minjoon Seo,
Abstract summary: We analyze hybrid architectures through the lens of memory utilization and overall performance.<n> sequential hybrids perform better on shorter contexts, whereas parallel hybrids are more effective for longer contexts.<n>We introduce a data-centric approach of continually training on datasets augmented with paraphrases, which further enhances recall while preserving other capabilities.
Score: 59.84975924845338
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hybrid models that combine state space models (SSMs) with attention mechanisms have shown strong performance by leveraging the efficiency of SSMs and the high recall ability of attention. However, the architectural design choices behind these hybrid models remain insufficiently understood. In this work, we analyze hybrid architectures through the lens of memory utilization and overall performance, and propose a complementary method to further enhance their effectiveness. We first examine the distinction between sequential and parallel integration of SSM and attention layers. Our analysis reveals several interesting findings, including that sequential hybrids perform better on shorter contexts, whereas parallel hybrids are more effective for longer contexts. We also introduce a data-centric approach of continually training on datasets augmented with paraphrases, which further enhances recall while preserving other capabilities. It generalizes well across different base models and outperforms architectural modifications aimed at enhancing recall. Our findings provide a deeper understanding of hybrid SSM-attention models and offer practical guidance for designing architectures tailored to various use cases. Our findings provide a deeper understanding of hybrid SSM-attention models and offer practical guidance for designing architectures tailored to various use cases.

Related papers

Distributed Hybrid Parallelism for Large Language Models: Comparative Study and System Design Guide [15.92814573525633]
This paper offers a comprehensive review of collective operations and distributed parallel strategies.<n>We examine hybrid parallelization designs, emphasizing communication overlap across different stages of model deployment.<n>We highlight open challenges and limitations of current LLM training paradigms and outline promising directions for the next generation of large scale model development.
arXiv Detail & Related papers (2026-02-09T19:01:13Z)
Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling [83.29209853451697]
Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs)<n>We introduce HGMem, a hypergraph-based memory mechanism that extends the concept of memory into a dynamic, expressive structure for complex reasoning and global understanding.<n>In our approach, memory is represented as a hypergraph whose hyperedges correspond to distinct memory units, enabling the progressive formation of higher-order interactions within memory.
arXiv Detail & Related papers (2025-12-30T03:13:10Z)
HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation [72.69742127579508]
Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models)<n>In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors.<n> Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge.
arXiv Detail & Related papers (2025-11-25T17:23:38Z)
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights [17.46576657832284]
Large language modelscombining self-attention mechanisms with structured state space models like Mamba can achieve a compelling balance between modeling quality and computational efficiency.<n>We present a holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion.
arXiv Detail & Related papers (2025-10-06T13:30:07Z)
Long-Context State-Space Video World Models [66.28743632951218]
We propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency.<n>Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory.<n>Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory.
arXiv Detail & Related papers (2025-05-26T16:12:41Z)
Minitron-SSM: Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning [76.88243649182886]
Hybrid architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance.<n>Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost.<n>We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities.
arXiv Detail & Related papers (2025-04-15T17:26:29Z)
Hymba: A Hybrid-head Architecture for Small Language Models [65.94140459055244]
Hymba is a family of small language models featuring a hybrid-head parallel architecture. We introduce learnable meta tokens that are prepended to prompts, storing critical information. This model is further optimized by incorporating cross-layer key-value sharing and partial sliding window attention.
arXiv Detail & Related papers (2024-11-20T19:51:25Z)
AI-Empowered Hybrid MIMO Beamforming [85.48860461696417]
Hybrid multiple-input multiple-output (MIMO) systems implement part of their beamforming in analog and part in digital. Recent years have witnessed a growing interest in using data-aided artificial intelligence (AI) tools for hybrid beamforming design. This article reviews candidate strategies to leverage data to improve real-time hybrid beamforming design.
arXiv Detail & Related papers (2023-03-03T06:04:20Z)
Robust Hybrid Learning With Expert Augmentation [31.911717646180886]
We introduce a hybrid data augmentation strategy termed textitexpert augmentation We demonstrate that expert augmentation, which can be incorporated into existing hybrid systems, improves generalization. We also assess the potential real-world applicability of expert augmentation on a dataset of a real double pendulum.
arXiv Detail & Related papers (2022-02-08T14:11:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.