Related papers: Hymba: A Hybrid-head Architecture for Small Language Models

Hymba: A Hybrid-head Architecture for Small Language Models

URL: http://arxiv.org/abs/2411.13676v1
Date: Wed, 20 Nov 2024 19:51:25 GMT
Title: Hymba: A Hybrid-head Architecture for Small Language Models
Authors: Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov,
Abstract summary: Hymba is a family of small language models featuring a hybrid-head parallel architecture. We introduce learnable meta tokens that are prepended to prompts, storing critical information. This model is further optimized by incorporating cross-layer key-value sharing and partial sliding window attention.
Score: 65.94140459055244
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.

Related papers

KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model [46.95431131609286]
KaLM-Embedding-V2, a versatile and compact embedding model, achieves impressive performance in general-purpose text embedding tasks.<n>Key innovations include: (i) pre-training on large-scale weakly supervised open-source corpora; (ii) fine-tuning on high-quality retrieval and non-retrieval datasets; and (iii) model-soup parameter averaging for robust generalization.
arXiv Detail & Related papers (2025-06-26T01:09:44Z)
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z)
Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning [54.584665518334035]
Hybrid architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities.
arXiv Detail & Related papers (2025-04-15T17:26:29Z)
Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models [75.58140912100318]
We introduce an efficient large language model specialized for the system domain, empowered by a novel architecture including DiffQKV attention. We conduct experiments that demonstrate the model's varying sensitivity to the compression of K and V components, leading to the development of differentially compressed KV. We introduce the first comprehensive benchmark AIMicius, where Sigma demonstrates remarkable performance across all tasks, significantly outperforming GPT-4 with an absolute improvement up to 52.5%.
arXiv Detail & Related papers (2025-01-23T12:58:14Z)
CEReBrO: Compact Encoder for Representations of Brain Oscillations Using Efficient Alternating Attention [53.539020807256904]
We introduce a Compact for Representations of Brain Oscillations using alternating attention (CEReBrO) Our tokenization scheme represents EEG signals at a per-channel patch. We propose an alternating attention mechanism that jointly models intra-channel temporal dynamics and inter-channel spatial correlations, achieving 2x speed improvement with 6x less memory required compared to standard self-attention.
arXiv Detail & Related papers (2025-01-18T21:44:38Z)
GroupMamba: Parameter-Efficient and Accurate Group Visual State Space Model [66.35608254724566]
State-space models (SSMs) have showcased effective performance in modeling long-range dependencies with subquadratic complexity. However, pure SSM-based models still face challenges related to stability and achieving optimal performance on computer vision tasks. Our paper addresses the challenges of scaling SSM-based models for computer vision, particularly the instability and inefficiency of large model sizes.
arXiv Detail & Related papers (2024-07-18T17:59:58Z)
HRSAM: Efficient Interactive Segmentation in High-Resolution Images [59.537068118473066]
Segment Anything Model (SAM) has advanced interactive segmentation but is limited by the high computational cost on high-resolution images. We focus on visual length extrapolation and propose a lightweight model named HRSAM. The extrapolation enables HRSAM trained on low resolutions to generalize to high resolutions.
arXiv Detail & Related papers (2024-07-02T09:51:56Z)
Imp: Highly Capable Large Multimodal Models for Mobile Devices [19.328141787433704]
Large language models (LLMs) have shown remarkable versatility in open-world multimodal understanding. They are usually parameter-heavy and computation-intensive, thus hindering their applicability in resource-constrained scenarios. In this paper, we conduct a systematic study for lightweight LMMs from the aspects of model architecture, training strategy, and training data. Based on our findings, we obtain Imp -- a family of highly capable LMMs at the 2B-4B scales.
arXiv Detail & Related papers (2024-05-20T15:23:19Z)
Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training [73.90260246781435]
We present Lory, the first approach that scales such architectures to autoregressive language model pre-training. We show significant performance gains over parameter-matched dense models on both perplexity and a variety of downstream tasks. Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing.
arXiv Detail & Related papers (2024-05-06T03:06:33Z)
DiTMoS: Delving into Diverse Tiny-Model Selection on Microcontrollers [34.282971510732736]
We introduce DiTMoS, a novel DNN training and inference framework with a selector-classifiers architecture. A composition of weak models can exhibit high diversity and the union of them can significantly boost the accuracy upper bound. We deploy DiTMoS on the Neucleo STM32F767ZI board and evaluate it based on three time-series datasets for human activity recognition, keywords spotting, and emotion recognition.
arXiv Detail & Related papers (2024-03-14T02:11:38Z)
Dual-Query Multiple Instance Learning for Dynamic Meta-Embedding based Tumor Classification [5.121989578393729]
Whole slide image (WSI) assessment is a challenging and crucial step in cancer diagnosis and treatment planning. Coarse-grained labels are easily accessible, which makes WSI classification an ideal use case for multiple instance learning (MIL) We propose a novel embedding-based Dual-Query MIL pipeline (DQ-MIL)
arXiv Detail & Related papers (2023-07-14T17:06:49Z)
Faster Attention Is What You Need: A Fast Self-Attention Neural Network Backbone Architecture for the Edge via Double-Condensing Attention Condensers [71.40595908386477]
We introduce a new faster attention condenser design called double-condensing attention condensers. The resulting backbone (which we name AttendNeXt) achieves significantly higher inference throughput on an embedded ARM processor. These promising results demonstrate that exploring different efficient architecture designs and self-attention mechanisms can lead to interesting new building blocks for TinyML applications.
arXiv Detail & Related papers (2022-08-15T02:47:33Z)
CascadER: Cross-Modal Cascading for Knowledge Graph Link Prediction [22.96768147978534]
We propose a tiered ranking architecture CascadER to maintain the ranking accuracy of full ensembling while improving efficiency considerably. CascadER uses LMs to rerank the outputs of more efficient base KGEs, relying on an adaptive subset selection scheme aimed at invoking the LMs minimally while maximizing accuracy gain over the KGE. Our empirical analyses reveal that diversity of models across modalities and preservation of individual models' confidence signals help explain the effectiveness of CascadER.
arXiv Detail & Related papers (2022-05-16T22:55:45Z)
Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs) We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.