Related papers: Retrieval-Aware Distillation for Transformer-SSM Hybrids

Retrieval-Aware Distillation for Transformer-SSM Hybrids

URL: http://arxiv.org/abs/2602.11374v1
Date: Wed, 11 Feb 2026 21:05:00 GMT
Title: Retrieval-Aware Distillation for Transformer-SSM Hybrids
Authors: Aviv Bick, Eric P. Xing, Albert Gu,
Abstract summary: State-space models (SSMs) offer efficient sequence modeling but lag behind Transformers on benchmarks that require in-context retrieval.<n>We propose *retrieval-aware distillation*, which converts a pretrained Transformer into a hybrid student by preserving only these retrieval-critical heads.<n>We show that preserving **just 2% of attention heads recovers over 95% of teacher performance on retrieval-heavy tasks.
Score: 56.85859614817908
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-space models (SSMs) offer efficient sequence modeling but lag behind Transformers on benchmarks that require in-context retrieval. Prior work links this gap to a small set of attention heads, termed Gather-and-Aggregate (G&A), which SSMs struggle to reproduce. We propose *retrieval-aware distillation*, which converts a pretrained Transformer into a hybrid student by preserving only these retrieval-critical heads and distilling the rest into recurrent heads. We identify the essential heads via ablation on a synthetic retrieval task, producing a hybrid with sparse, non-uniform attention placement. We show that preserving **just 2% of attention heads recovers over 95% of teacher performance on retrieval-heavy tasks** (10 heads in a 1B model), requiring far fewer heads than hybrids that retain at least 25%. We further find that large recurrent states often compensate for missing retrieval: once retrieval is handled by these heads, the SSM backbone can be simplified with limited loss, even with an $8\times$ reduction in state dimension. By reducing both the attention cache and the SSM state, the resulting hybrid is $5$--$6\times$ more memory-efficient than comparable hybrids, closing the Transformer--SSM gap at a fraction of the memory cost.

Related papers

Memory Caching: RNNs with Growing Memory [56.25483647131372]
We introduce Memory Caching (MC), a technique that enhances recurrent models by caching checkpoints of memory states (a.k.a. hidden states)<n>We propose four variants of MC, including gated aggregation and sparse selective mechanisms, and discuss their implications on both linear and deep memory modules.<n>The results indicate that while Transformers achieve the best accuracy, our MC variants show competitive performance, close the gap with Transformers, and performs better than state-of-the-art recurrent models.
arXiv Detail & Related papers (2026-02-27T18:53:41Z)
MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling [80.48332380100915]
MiniCPM-SALA is a hybrid model that integrates the high-fidelity long-context modeling of sparse attention with the global efficiency of linear attention.<n>On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens.
arXiv Detail & Related papers (2026-02-12T09:37:05Z)
Apriel-H1: Towards Efficient Enterprise Reasoning Models [6.630534140883356]
Apriel-H1 family of hybrid LLMs combine transformer attention and SSM sequence mixers for efficient reasoning at 15B model size.<n>We release multiple post-distillation variants of Apriel-H1-15B-Thinker with different SSM-to-MHA ratios and analyse how reasoning performance degrades as more Mamba layers replace MHA.
arXiv Detail & Related papers (2025-11-04T15:17:43Z)
Autoencoder-Based Hybrid Replay for Class-Incremental Learning [10.061328213032088]
In class-incremental learning (CIL), effective incremental learning strategies are essential to mitigate task confusion and forgetting.<n>We propose an autoencoder-based hybrid replay (AHR) strategy that leverages our new hybrid autoencoder (HAE) to function as a compressor.
arXiv Detail & Related papers (2025-05-09T09:59:12Z)
Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism [15.626801223435173]
State-space models (SSMs) offer efficient alternatives to Transformers for long sequences.<n>In this work, we examine how in-context retrieval operates in Transformer- and SSM-based language models.
arXiv Detail & Related papers (2025-04-22T16:15:19Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
Variance-reduced Zeroth-Order Methods for Fine-Tuning Language Models [17.027512781038617]
Zeroth-order (ZO) optimization methods can leverage memory-efficient forward passes to estimate. MeZO, an adaptation of ZO-SGD, has been shown to consistently outperform zero-shot and in-context learning. MeZO-SVRG significantly reduces the required memory footprint compared to first-order SGD.
arXiv Detail & Related papers (2024-04-11T18:35:49Z)
Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z)
Degradation-Aware Unfolding Half-Shuffle Transformer for Spectral Compressive Imaging [142.11622043078867]
We propose a principled Degradation-Aware Unfolding Framework (DAUF) that estimates parameters from the compressed image and physical mask, and then uses these parameters to control each iteration. By plugging HST into DAUF, we establish the first Transformer-based deep unfolding method, Degradation-Aware Unfolding Half-Shuffle Transformer (DAUHST) for HSI reconstruction.
arXiv Detail & Related papers (2022-05-20T11:37:44Z)
MFAGAN: A Compression Framework for Memory-Efficient On-Device Super-Resolution GAN [27.346272886257335]
We propose a novel compression framework textbfMulti-scale textbfFeature textbfAggregation Net based textbfGAN (MFAGAN) for reducing the memory access cost of the generator. MFAGAN achieves up to textbf8.3$times$ memory saving and textbf42.9$times$ computation reduction, compared with ESRGAN.
arXiv Detail & Related papers (2021-07-27T09:04:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.