Hymba: A Hybrid-head Architecture for Small Language Models
- URL: http://arxiv.org/abs/2411.13676v1
- Date: Wed, 20 Nov 2024 19:51:25 GMT
- Title: Hymba: A Hybrid-head Architecture for Small Language Models
- Authors: Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov,
- Abstract summary: Hymba is a family of small language models featuring a hybrid-head parallel architecture.
We introduce learnable meta tokens that are prepended to prompts, storing critical information.
This model is further optimized by incorporating cross-layer key-value sharing and partial sliding window attention.
- Score: 65.94140459055244
- License:
- Abstract: We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.
Related papers
- GroupMamba: Parameter-Efficient and Accurate Group Visual State Space Model [66.35608254724566]
State-space models (SSMs) have showcased effective performance in modeling long-range dependencies with subquadratic complexity.
However, pure SSM-based models still face challenges related to stability and achieving optimal performance on computer vision tasks.
Our paper addresses the challenges of scaling SSM-based models for computer vision, particularly the instability and inefficiency of large model sizes.
arXiv Detail & Related papers (2024-07-18T17:59:58Z) - HRSAM: Efficient Interactive Segmentation in High-Resolution Images [59.537068118473066]
Segment Anything Model (SAM) has advanced interactive segmentation but is limited by the high computational cost on high-resolution images.
We focus on visual length extrapolation and propose a lightweight model named HRSAM.
The extrapolation enables HRSAM trained on low resolutions to generalize to high resolutions.
arXiv Detail & Related papers (2024-07-02T09:51:56Z) - Imp: Highly Capable Large Multimodal Models for Mobile Devices [19.328141787433704]
Large language models (LLMs) have shown remarkable versatility in open-world multimodal understanding.
They are usually parameter-heavy and computation-intensive, thus hindering their applicability in resource-constrained scenarios.
In this paper, we conduct a systematic study for lightweight LMMs from the aspects of model architecture, training strategy, and training data.
Based on our findings, we obtain Imp -- a family of highly capable LMMs at the 2B-4B scales.
arXiv Detail & Related papers (2024-05-20T15:23:19Z) - Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training [73.90260246781435]
We present Lory, the first approach that scales such architectures to autoregressive language model pre-training.
We show significant performance gains over parameter-matched dense models on both perplexity and a variety of downstream tasks.
Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing.
arXiv Detail & Related papers (2024-05-06T03:06:33Z) - DiTMoS: Delving into Diverse Tiny-Model Selection on Microcontrollers [34.282971510732736]
We introduce DiTMoS, a novel DNN training and inference framework with a selector-classifiers architecture.
A composition of weak models can exhibit high diversity and the union of them can significantly boost the accuracy upper bound.
We deploy DiTMoS on the Neucleo STM32F767ZI board and evaluate it based on three time-series datasets for human activity recognition, keywords spotting, and emotion recognition.
arXiv Detail & Related papers (2024-03-14T02:11:38Z) - Dual-Query Multiple Instance Learning for Dynamic Meta-Embedding based
Tumor Classification [5.121989578393729]
Whole slide image (WSI) assessment is a challenging and crucial step in cancer diagnosis and treatment planning.
Coarse-grained labels are easily accessible, which makes WSI classification an ideal use case for multiple instance learning (MIL)
We propose a novel embedding-based Dual-Query MIL pipeline (DQ-MIL)
arXiv Detail & Related papers (2023-07-14T17:06:49Z) - Faster Attention Is What You Need: A Fast Self-Attention Neural Network
Backbone Architecture for the Edge via Double-Condensing Attention Condensers [71.40595908386477]
We introduce a new faster attention condenser design called double-condensing attention condensers.
The resulting backbone (which we name AttendNeXt) achieves significantly higher inference throughput on an embedded ARM processor.
These promising results demonstrate that exploring different efficient architecture designs and self-attention mechanisms can lead to interesting new building blocks for TinyML applications.
arXiv Detail & Related papers (2022-08-15T02:47:33Z) - CascadER: Cross-Modal Cascading for Knowledge Graph Link Prediction [22.96768147978534]
We propose a tiered ranking architecture CascadER to maintain the ranking accuracy of full ensembling while improving efficiency considerably.
CascadER uses LMs to rerank the outputs of more efficient base KGEs, relying on an adaptive subset selection scheme aimed at invoking the LMs minimally while maximizing accuracy gain over the KGE.
Our empirical analyses reveal that diversity of models across modalities and preservation of individual models' confidence signals help explain the effectiveness of CascadER.
arXiv Detail & Related papers (2022-05-16T22:55:45Z) - Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs)
We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.