Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation
- URL: http://arxiv.org/abs/2507.06607v2
- Date: Wed, 16 Jul 2025 07:00:01 GMT
- Title: Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation
- Authors: Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, Hao Cheng, Jianfeng Gao, Weizhu Chen, Yelong Shen,
- Abstract summary: We introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers.<n>We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs to share memory readout states from a Samba-based self-decoder.
- Score: 129.45368843861917
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.
Related papers
- ACM-UNet: Adaptive Integration of CNNs and Mamba for Efficient Medical Image Segmentation [9.006936485052128]
ACM-UNet is a general-purpose segmentation framework for medical images.<n>It incorporates pretrained CNNs and Mamba models through a lightweight adapter mechanism.<n>It achieves state-of-the-art performance while remaining computationally efficient.
arXiv Detail & Related papers (2025-05-30T11:30:53Z) - Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z) - Return of the Encoder: Maximizing Parameter Efficiency for SLMs [4.246337121596753]
encoder-decoder architectures achieve 47% lower first-token latency and 4.7x higher throughput compared to decoder-only models on edge devices.<n>We introduce a novel knowledge distillation framework that enables encoder-decoder models to leverage capabilities from large scalable decoder-only teachers.
arXiv Detail & Related papers (2025-01-27T18:06:36Z) - Efficiently Serving Large Multimodal Models Using EPD Disaggregation [24.05805398635414]
We introduce Encode-Prefill-Decode Disaggregation, a novel framework that separates the encoding, prefill, and decode stages onto dedicated resources.<n>Unlike current systems, which bundle encoding and prefill together, our approach decouples these steps, unlocking new opportunities and optimizations.<n> Experimental evaluations with popular LMMs show substantial gains in memory efficiency (up to 15x lower peak memory utilization), batch sizes (up to 22x larger), 10x more images per request, and 2.2x larger KV caches.
arXiv Detail & Related papers (2024-12-25T10:11:31Z) - Efficient Self-Supervised Video Hashing with Selective State Spaces [63.83300352372051]
Self-supervised video hashing (SSVH) is a practical task in video indexing and retrieval.<n>We introduce S5VH, a Mamba-based video hashing model with an improved self-supervised learning paradigm.
arXiv Detail & Related papers (2024-12-19T04:33:22Z) - Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient [52.96232442322824]
Collaborative Decoding (CoDe) is a novel efficient decoding strategy tailored for the Visual Auto-Regressive ( VAR) framework.<n>CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales.<n>CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98.
arXiv Detail & Related papers (2024-11-26T15:13:15Z) - B'MOJO: Hybrid State Space Realizations of Foundation Models with Eidetic and Fading Memory [91.81390121042192]
We develop a class of models called B'MOJO to seamlessly combine eidetic and fading memory within an composable module.
B'MOJO's ability to modulate eidetic and fading memory results in better inference on longer sequences tested up to 32K tokens.
arXiv Detail & Related papers (2024-07-08T18:41:01Z) - Mamba YOLO: A Simple Baseline for Object Detection with State Space Model [10.44725284994877]
YOLO series has set a new benchmark for real-time object detectors.<n>Transformer-based structures have emerged as the most powerful solution.<n>However, the quadratic complexity of the self-attentive mechanism increases the computational burden.<n>We introduce a simple yet effective baseline approach called Mamba YOLO.
arXiv Detail & Related papers (2024-06-09T15:56:19Z) - S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs [7.816840847892339]
Speculative decoding (SD) has attracted a significant amount of research attention due to the substantial speedup it can achieve for LLM inference.
We propose Skippy Simultaneous Speculative Decoding (or S3D), a cost-effective self-speculative SD method based on simultaneous multi-token decoding and mid-layer skipping.
Our method has achieved one of the top performance-memory ratios while requiring minimal architecture changes and training data.
arXiv Detail & Related papers (2024-05-30T17:54:35Z) - You Only Cache Once: Decoder-Decoder Architectures for Language Models [132.4064488592704]
We introduce a decoder-decoder architecture, YOCO, for large language models.
YOCO only caches key-value pairs once.
The overall model behaves like a decoder-only Transformer, although YOCO only caches once.
arXiv Detail & Related papers (2024-05-08T17:57:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.