UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLMs
- URL: http://arxiv.org/abs/2502.00439v1
- Date: Sat, 01 Feb 2025 14:16:31 GMT
- Title: UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLMs
- Authors: Yizhe Xiong, Wei Huang, Xin Ye, Hui Chen, Zijia Lin, Haoran Lian, Zhenpeng Su, Jungong Han, Guiguang Ding,
- Abstract summary: Post-training is essential for adapting Large Language Models (LLMs) to real-world applications.
We propose Softmax textbfUnification in textbfAttetextbfntion (textbfUniAttn), a novel post-training method that unifies Softmax activations across transformer blocks to reduce inference costs.
- Score: 58.79414743733813
- License:
- Abstract: Post-training is essential for adapting Large Language Models (LLMs) to real-world applications. Deploying post-trained models faces significant challenges due to substantial memory overhead and noticeable inference latency. Existing work has identified significant redundancies in LLMs and proposed efficient architectures, namely intra-layer KV sharing and cross-layer KV sharing. However, intra-layer KV sharing still results in high inference costs, while cross-layer KV sharing leads to significant performance degradation. As a result, both methods remain suboptimal for post-training pre-trained LLMs. In this paper, we identify that the \texttt{Softmax} operation is a primary bottleneck for LLM inference and discover that it is actually highly redundant during post-training. We propose Softmax \textbf{Uni}fication in \textbf{Att}e\textbf{n}tion (\textbf{UniAttn}), a novel post-training method that unifies Softmax activations across transformer blocks to reduce LLM inference costs. Additionally, UniAttn adopts a linear projection to compensate for the errors induced by Softmax unification. Experiments show that UniAttn matches the performance of standard post-training while significantly reducing inference costs, outperforming existing efficient architectures during post-training. Our code will be available at \url{https://github.com/Bostoncake/UniAttn}.
Related papers
- CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation [17.807249890437767]
We introduce CoLA and its memory-efficient implementation, CoLA-M.
We leverage the low-rank structure observed widely in model activations to reduce model size, boost model capacity and training efficiency.
Experiments on LLaMA models with 60 million to 7 billion parameters show that CoLA reduces the computing cost by $bf 2pmbtimes$ and improves training throughput by $bf 1.86pmbtimes$ while maintaining full-rank level performance.
arXiv Detail & Related papers (2025-02-16T01:05:16Z) - EfficientLLM: Scalable Pruning-Aware Pretraining for Architecture-Agnostic Edge Language Models [25.058673320372677]
Large language models (LLMs) driven by scaling laws, achieve intelligence emergency in large model sizes.
This work proposes the pruning-aware pretraining, focusing on retaining performance of much larger optimized models.
We reveal that it achieves top-quality edge language models, termed EfficientLLM, by scaling up LLM compression and extending its boundary.
arXiv Detail & Related papers (2025-02-10T16:51:03Z) - Why Does the Effective Context Length of LLMs Fall Short? [68.34573617977013]
In this work, we introduce ShifTed Rotray position embeddING (STRING)
STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths.
Experimental results show that STRING dramatically improves the performance of the latest large-scale models.
arXiv Detail & Related papers (2024-10-24T13:51:50Z) - Pruning Foundation Models for High Accuracy without Retraining [48.256389781305415]
It is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations.
Post-training pruning methods are proposed to prune LLMs in one-shot without retraining.
Our experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines.
arXiv Detail & Related papers (2024-10-21T01:23:34Z) - An Efficient Inference Framework for Early-exit Large Language Models [5.048467183620882]
Early-exit models improve the inference efficiency of LLMs by skipping rest layers and directly generate output tokens when confident enough.
There is no work of LLM inference framework that takes early-exit models into consideration.
We solve two key challenges in building efficient inference framework for early-exit models: (1) batch inference at iteration-level granularity; and (2) KV cache management.
arXiv Detail & Related papers (2024-07-25T07:50:17Z) - Linearizing Large Language Models [26.94551511277412]
We present a method to uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with a modest compute budget.
We find that our linearization technique leads to competitive performance on standard benchmarks, but we identify persistent in-context learning and long-context modeling shortfalls for even the largest linear models.
arXiv Detail & Related papers (2024-05-10T17:59:08Z) - PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation [61.57833648734164]
We propose a novel Parallel Yielding Re-Activation (PYRA) method for training-inference efficient task adaptation.
PYRA outperforms all competing methods under both low compression rate and high compression rate.
arXiv Detail & Related papers (2024-03-14T09:06:49Z) - From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers [52.199303258423306]
We propose a novel density loss that encourages higher activation sparsity in pre-trained models.
Our proposed method, textbfDEFT, can consistently reduce activation density by up to textbf44.94% on RoBERTa$_mathrmLarge$ and by textbf53.19% (encoder density) and textbf90.60% (decoder density) on Flan-T5$_mathrmXXL$.
arXiv Detail & Related papers (2024-02-02T21:25:46Z) - EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism [70.07661254213181]
We present EE-LLM, a framework for large-scale training and inference of early-exit large language models (LLMs)
Built upon Megatron-LM, EE-LLM implements a variety of algorithmic innovations and performance optimizations tailored to early exiting.
Our analytical and empirical study shows that EE-LLM achieves great training efficiency with negligible computational overhead.
arXiv Detail & Related papers (2023-12-08T09:31:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.