Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models
- URL: http://arxiv.org/abs/2501.13629v2
- Date: Mon, 10 Feb 2025 17:19:21 GMT
- Title: Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models
- Authors: Zhenghao Lin, Zihao Tang, Xiao Liu, Yeyun Gong, Yi Cheng, Qi Chen, Hang Li, Ying Xin, Ziyue Yang, Kailai Yang, Yu Yan, Xiao Liang, Shuai Lu, Yiming Huang, Zheheng Luo, Lei Qu, Xuan Feng, Yaoxiang Wang, Yuqing Xia, Feiyang Chen, Yuting Jiang, Yasen Hu, Hao Ni, Binyang Li, Guoshuai Zhao, Jui-Hao Chiang, Zhongxin Guo, Chen Lin, Kun Kuang, Wenjie Li, Yelong Shen, Jian Jiao, Peng Cheng, Mao Yang,
- Abstract summary: We introduce an efficient large language model specialized for the system domain, empowered by a novel architecture including DiffQKV attention.
We conduct experiments that demonstrate the model's varying sensitivity to the compression of K and V components, leading to the development of differentially compressed KV.
We introduce the first comprehensive benchmark AIMicius, where Sigma demonstrates remarkable performance across all tasks, significantly outperforming GPT-4 with an absolute improvement up to 52.5%.
- Score: 75.58140912100318
- License:
- Abstract: We introduce Sigma, an efficient large language model specialized for the system domain, empowered by a novel architecture including DiffQKV attention, and pre-trained on our meticulously collected system domain data. DiffQKV attention significantly enhances the inference efficiency of Sigma by optimizing the Query (Q), Key (K), and Value (V) components in the attention mechanism differentially, based on their varying impacts on the model performance and efficiency indicators. Specifically, we (1) conduct extensive experiments that demonstrate the model's varying sensitivity to the compression of K and V components, leading to the development of differentially compressed KV, and (2) propose augmented Q to expand the Q head dimension, which enhances the model's representation capacity with minimal impacts on the inference speed. Rigorous theoretical and empirical analyses reveal that DiffQKV attention significantly enhances efficiency, achieving up to a 33.36% improvement in inference speed over the conventional grouped-query attention (GQA) in long-context scenarios. We pre-train Sigma on 6T tokens from various sources, including 19.5B system domain data that we carefully collect and 1T tokens of synthesized and rewritten data. In general domains, Sigma achieves comparable performance to other state-of-arts models. In the system domain, we introduce the first comprehensive benchmark AIMicius, where Sigma demonstrates remarkable performance across all tasks, significantly outperforming GPT-4 with an absolute improvement up to 52.5%.
Related papers
- Hymba: A Hybrid-head Architecture for Small Language Models [65.94140459055244]
Hymba is a family of small language models featuring a hybrid-head parallel architecture.
We introduce learnable meta tokens that are prepended to prompts, storing critical information.
This model is further optimized by incorporating cross-layer key-value sharing and partial sliding window attention.
arXiv Detail & Related papers (2024-11-20T19:51:25Z) - Don't Just Pay Attention, PLANT It: Transfer L2R Models to Fine-tune Attention in Extreme Multi-Label Text Classification [1.6385815610837162]
We introduce PLANT -- Pretrained and Leveraged AtteNTion -- a novel transfer learning strategy for fine-tuning XMTC decoders.
PLANT surpasses existing state-of-the-art methods across all metrics on mimicfull, mimicfifty, mimicfour, eurlex, and wikiten datasets.
It particularly excels in few-shot scenarios, outperforming previous models specifically designed for few-shot scenarios by over 50 percentage points in F1 scores on mimicrare and by over 36 percentage points on mimicfew.
arXiv Detail & Related papers (2024-10-30T14:41:23Z) - Gamified crowd-sourcing of high-quality data for visual fine-tuning [0.9487395978583629]
This paper introduces Gamified Adversarial Prompting (GAP), a framework that crowd-sources high-quality data for visual instruction tuning of large multimodal models.
GAP transforms the data collection process into an engaging game, incentivizing players to provide fine-grained, challenging questions and answers that target gaps in the model's knowledge.
arXiv Detail & Related papers (2024-10-05T05:10:29Z) - Zero-Shot Embeddings Inform Learning and Forgetting with Vision-Language Encoders [6.7181844004432385]
The Inter-Intra Modal Measure (IIMM) functions as a strong predictor of performance changes with fine-tuning.
Fine-tuning on tasks with higher IIMM scores produces greater in-domain performance gains but also induces more severe out-of-domain performance degradation.
With only a single forward pass of the target data, practitioners can leverage this key insight to evaluate the degree to which a model can be expected to improve following fine-tuning.
arXiv Detail & Related papers (2024-07-22T15:35:09Z) - Interpreting and Improving Attention From the Perspective of Large Kernel Convolution [51.06461246235176]
We introduce Large Kernel Convolutional Attention (LKCA), a novel formulation that reinterprets attention operations as a single large- Kernel convolution.
LKCA achieves competitive performance across various visual tasks, particularly in data-constrained settings.
arXiv Detail & Related papers (2024-01-11T08:40:35Z) - Data-Centric Long-Tailed Image Recognition [49.90107582624604]
Long-tail models exhibit a strong demand for high-quality data.
Data-centric approaches aim to enhance both the quantity and quality of data to improve model performance.
There is currently a lack of research into the underlying mechanisms explaining the effectiveness of information augmentation.
arXiv Detail & Related papers (2023-11-03T06:34:37Z) - Towards quantitative precision for ECG analysis: Leveraging state space
models, self-supervision and patient metadata [2.0777058026628583]
We investigate three elements aimed at improving the quantitative accuracy of automatic ECG analysis systems.
First, we exploit structured state space models (SSMs) to capture long-term dependencies in time series data.
Secondly, we demonstrate that self-supervised learning using contrastive predictive coding can further improve the performance of SSMs.
Finally, we incorporate basic demographic metadata alongside the ECG signal as input.
arXiv Detail & Related papers (2023-08-29T13:25:26Z) - How Knowledge Graph and Attention Help? A Quantitative Analysis into
Bag-level Relation Extraction [66.09605613944201]
We quantitatively evaluate the effect of attention and Knowledge Graph on bag-level relation extraction (RE)
We find that (1) higher attention accuracy may lead to worse performance as it may harm the model's ability to extract entity mention features; (2) the performance of attention is largely influenced by various noise distribution patterns; and (3) KG-enhanced attention indeed improves RE performance, while not through enhanced attention but by incorporating entity prior.
arXiv Detail & Related papers (2021-07-26T09:38:28Z) - Cauchy-Schwarz Regularized Autoencoder [68.80569889599434]
Variational autoencoders (VAE) are a powerful and widely-used class of generative models.
We introduce a new constrained objective based on the Cauchy-Schwarz divergence, which can be computed analytically for GMMs.
Our objective improves upon variational auto-encoding models in density estimation, unsupervised clustering, semi-supervised learning, and face analysis.
arXiv Detail & Related papers (2021-01-06T17:36:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.