Hymba: A Hybrid-head Architecture for Small Language Models
- URL: http://arxiv.org/abs/2411.13676v1
- Date: Wed, 20 Nov 2024 19:51:25 GMT
- Title: Hymba: A Hybrid-head Architecture for Small Language Models
- Authors: Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov,
- Abstract summary: Hymba is a family of small language models featuring a hybrid-head parallel architecture.
We introduce learnable meta tokens that are prepended to prompts, storing critical information.
This model is further optimized by incorporating cross-layer key-value sharing and partial sliding window attention.
- Score: 65.94140459055244
- License:
- Abstract: We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.
Related papers
- Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models [75.58140912100318]
We introduce an efficient large language model specialized for the system domain, empowered by a novel architecture including DiffQKV attention.
We conduct experiments that demonstrate the model's varying sensitivity to the compression of K and V components, leading to the development of differentially compressed KV.
We introduce the first comprehensive benchmark AIMicius, where Sigma demonstrates remarkable performance across all tasks, significantly outperforming GPT-4 with an absolute improvement up to 52.5%.
arXiv Detail & Related papers (2025-01-23T12:58:14Z) - CEReBrO: Compact Encoder for Representations of Brain Oscillations Using Efficient Alternating Attention [53.539020807256904]
We introduce a Compact for Representations of Brain Oscillations using alternating attention (CEReBrO)
Our tokenization scheme represents EEG signals at a per-channel patch.
We propose an alternating attention mechanism that jointly models intra-channel temporal dynamics and inter-channel spatial correlations, achieving 2x speed improvement with 6x less memory required compared to standard self-attention.
arXiv Detail & Related papers (2025-01-18T21:44:38Z) - HRSAM: Efficient Interactive Segmentation in High-Resolution Images [59.537068118473066]
Segment Anything Model (SAM) has advanced interactive segmentation but is limited by the high computational cost on high-resolution images.
We focus on visual length extrapolation and propose a lightweight model named HRSAM.
The extrapolation enables HRSAM trained on low resolutions to generalize to high resolutions.
arXiv Detail & Related papers (2024-07-02T09:51:56Z) - Imp: Highly Capable Large Multimodal Models for Mobile Devices [19.328141787433704]
Large language models (LLMs) have shown remarkable versatility in open-world multimodal understanding.
They are usually parameter-heavy and computation-intensive, thus hindering their applicability in resource-constrained scenarios.
In this paper, we conduct a systematic study for lightweight LMMs from the aspects of model architecture, training strategy, and training data.
Based on our findings, we obtain Imp -- a family of highly capable LMMs at the 2B-4B scales.
arXiv Detail & Related papers (2024-05-20T15:23:19Z) - Dual-Query Multiple Instance Learning for Dynamic Meta-Embedding based
Tumor Classification [5.121989578393729]
Whole slide image (WSI) assessment is a challenging and crucial step in cancer diagnosis and treatment planning.
Coarse-grained labels are easily accessible, which makes WSI classification an ideal use case for multiple instance learning (MIL)
We propose a novel embedding-based Dual-Query MIL pipeline (DQ-MIL)
arXiv Detail & Related papers (2023-07-14T17:06:49Z) - Faster Attention Is What You Need: A Fast Self-Attention Neural Network
Backbone Architecture for the Edge via Double-Condensing Attention Condensers [71.40595908386477]
We introduce a new faster attention condenser design called double-condensing attention condensers.
The resulting backbone (which we name AttendNeXt) achieves significantly higher inference throughput on an embedded ARM processor.
These promising results demonstrate that exploring different efficient architecture designs and self-attention mechanisms can lead to interesting new building blocks for TinyML applications.
arXiv Detail & Related papers (2022-08-15T02:47:33Z) - CascadER: Cross-Modal Cascading for Knowledge Graph Link Prediction [22.96768147978534]
We propose a tiered ranking architecture CascadER to maintain the ranking accuracy of full ensembling while improving efficiency considerably.
CascadER uses LMs to rerank the outputs of more efficient base KGEs, relying on an adaptive subset selection scheme aimed at invoking the LMs minimally while maximizing accuracy gain over the KGE.
Our empirical analyses reveal that diversity of models across modalities and preservation of individual models' confidence signals help explain the effectiveness of CascadER.
arXiv Detail & Related papers (2022-05-16T22:55:45Z) - Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs)
We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.