Related papers: UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLMs

UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLMs

URL: http://arxiv.org/abs/2502.00439v1
Date: Sat, 01 Feb 2025 14:16:31 GMT
Title: UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLMs
Authors: Yizhe Xiong, Wei Huang, Xin Ye, Hui Chen, Zijia Lin, Haoran Lian, Zhenpeng Su, Jungong Han, Guiguang Ding,
Abstract summary: Post-training is essential for adapting Large Language Models (LLMs) to real-world applications.<n>We propose Softmax textbfUnification in textbfAttetextbfntion (textbfUniAttn), a novel post-training method that unifies Softmax activations across transformer blocks to reduce inference costs.
Score: 58.79414743733813
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Post-training is essential for adapting Large Language Models (LLMs) to real-world applications. Deploying post-trained models faces significant challenges due to substantial memory overhead and noticeable inference latency. Existing work has identified significant redundancies in LLMs and proposed efficient architectures, namely intra-layer KV sharing and cross-layer KV sharing. However, intra-layer KV sharing still results in high inference costs, while cross-layer KV sharing leads to significant performance degradation. As a result, both methods remain suboptimal for post-training pre-trained LLMs. In this paper, we identify that the \texttt{Softmax} operation is a primary bottleneck for LLM inference and discover that it is actually highly redundant during post-training. We propose Softmax \textbf{Uni}fication in \textbf{Att}e\textbf{n}tion (\textbf{UniAttn}), a novel post-training method that unifies Softmax activations across transformer blocks to reduce LLM inference costs. Additionally, UniAttn adopts a linear projection to compensate for the errors induced by Softmax unification. Experiments show that UniAttn matches the performance of standard post-training while significantly reducing inference costs, outperforming existing efficient architectures during post-training. Our code will be available at \url{https://github.com/Bostoncake/UniAttn}.

Related papers

CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation [17.807249890437767]
We introduce CoLA and its memory-efficient implementation, CoLA-M. We leverage the low-rank structure observed widely in model activations to reduce model size, boost model capacity and training efficiency. Experiments on LLaMA models with 60 million to 7 billion parameters show that CoLA reduces the computing cost by $bf 2pmbtimes$ and improves training throughput by $bf 1.86pmbtimes$ while maintaining full-rank level performance.
arXiv Detail & Related papers (2025-02-16T01:05:16Z)
EfficientLLM: Scalable Pruning-Aware Pretraining for Architecture-Agnostic Edge Language Models [25.058673320372677]
Large language models (LLMs) driven by scaling laws, achieve intelligence emergency in large model sizes. This work proposes the pruning-aware pretraining, focusing on retaining performance of much larger optimized models. We reveal that it achieves top-quality edge language models, termed EfficientLLM, by scaling up LLM compression and extending its boundary.
arXiv Detail & Related papers (2025-02-10T16:51:03Z)
P$^2$ Law: Scaling Law for Post-Training After Model Pruning [25.07013858614455]
Pruning has become a widely adopted technique for reducing the hardware requirements of large language models (LLMs)<n>To recover model performance after pruning, post-training is commonly employed to mitigate the resulting performance degradation.<n>To balance post-training cost and model performance, it is necessary to explore the optimal amount of post-training data.
arXiv Detail & Related papers (2024-11-15T15:28:42Z)
Why Does the Effective Context Length of LLMs Fall Short? [68.34573617977013]
In this work, we introduce ShifTed Rotray position embeddING (STRING) STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths. Experimental results show that STRING dramatically improves the performance of the latest large-scale models.
arXiv Detail & Related papers (2024-10-24T13:51:50Z)
Pruning Foundation Models for High Accuracy without Retraining [48.256389781305415]
It is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. Post-training pruning methods are proposed to prune LLMs in one-shot without retraining. Our experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines.
arXiv Detail & Related papers (2024-10-21T01:23:34Z)
Privacy-preserved LLM Cascade via CoT-enhanced Policy Learning [14.51198171282123]
Large Language Models (LLMs) have gained significant attention in on-device applications due to their remarkable performance across real-world tasks. We propose a novel Chain-of-Thought (CoT)-enhanced textbfpolicy learning framework for textbfpreserved textbfdeferral decision-making.
arXiv Detail & Related papers (2024-10-10T15:09:52Z)
An Efficient Inference Framework for Early-exit Large Language Models [5.048467183620882]
Early-exit models improve the inference efficiency of LLMs by skipping rest layers and directly generate output tokens when confident enough. There is no work of LLM inference framework that takes early-exit models into consideration. We solve two key challenges in building efficient inference framework for early-exit models: (1) batch inference at iteration-level granularity; and (2) KV cache management.
arXiv Detail & Related papers (2024-07-25T07:50:17Z)
Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers [16.253898272659242]
State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive. Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFNs) We show that wide and structured networks can utilize training FLOPs more efficiently, with fewer parameters and lower loss than dense models at their optimal trade-off.
arXiv Detail & Related papers (2024-06-24T08:43:21Z)
Linearizing Large Language Models [26.94551511277412]
We present a method to uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with a modest compute budget. We find that our linearization technique leads to competitive performance on standard benchmarks, but we identify persistent in-context learning and long-context modeling shortfalls for even the largest linear models.
arXiv Detail & Related papers (2024-05-10T17:59:08Z)
PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation [61.57833648734164]
We propose a novel Parallel Yielding Re-Activation (PYRA) method for training-inference efficient task adaptation. PYRA outperforms all competing methods under both low compression rate and high compression rate.
arXiv Detail & Related papers (2024-03-14T09:06:49Z)
From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers [52.199303258423306]
We propose a novel density loss that encourages higher activation sparsity in pre-trained models. Our proposed method, textbfDEFT, can consistently reduce activation density by up to textbf44.94% on RoBERTa$_mathrmLarge$ and by textbf53.19% (encoder density) and textbf90.60% (decoder density) on Flan-T5$_mathrmXXL$.
arXiv Detail & Related papers (2024-02-02T21:25:46Z)
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism [70.07661254213181]
We present EE-LLM, a framework for large-scale training and inference of early-exit large language models (LLMs) Built upon Megatron-LM, EE-LLM implements a variety of algorithmic innovations and performance optimizations tailored to early exiting. Our analytical and empirical study shows that EE-LLM achieves great training efficiency with negligible computational overhead.
arXiv Detail & Related papers (2023-12-08T09:31:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.