Scalable Efficient Training of Large Language Models with Low-dimensional Projected Attention
- URL: http://arxiv.org/abs/2411.02063v1
- Date: Mon, 04 Nov 2024 13:06:17 GMT
- Title: Scalable Efficient Training of Large Language Models with Low-dimensional Projected Attention
- Authors: Xingtai Lv, Ning Ding, Kaiyan Zhang, Ermo Hua, Ganqu Cui, Bowen Zhou,
- Abstract summary: We find that low-rank pre-training, normally considered as efficient methods, can be scalably effective when reduced parameters are precisely targeted.
We refer to this structure as Low-dimensional Projected Attention (LPA) and provide an explanatory analysis.
Our results show that LPA model can save up to 12.4% in time while achieving an approximate 5% improvement in test perplexity (ppl) and on downstream tasks compared with the vanilla Transformer.
- Score: 27.46314600638108
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Improving the effectiveness and efficiency of large language models (LLMs) simultaneously is a critical yet challenging research goal. In this paper, we find that low-rank pre-training, normally considered as efficient methods that will compromise performance, can be scalably effective when reduced parameters are precisely targeted. Specifically, applying the low-dimensional module only to the attention layer -- resolves this issue and enhances both effectiveness and efficiency. We refer to this structure as Low-dimensional Projected Attention (LPA) and provide an explanatory analysis. Through extensive experimentation at parameter scales of 130M, 370M, and scaling up to 3B, we have validated the effectiveness and scalability of LPA. Our results show that LPA model can save up to 12.4% in time while achieving an approximate 5% improvement in test perplexity (ppl) and on downstream tasks compared with the vanilla Transformer.
Related papers
- Optimizing Active Learning in Vision-Language Models via Parameter-Efficient Uncertainty Calibration [6.7181844004432385]
We introduce a novel parameter-efficient learning methodology that incorporates uncertainty calibration loss within the Active Learning framework.<n>We demonstrate that our solution can match and exceed the performance of complex feature-based sampling techniques.
arXiv Detail & Related papers (2025-07-29T06:08:28Z) - On the Scaling of Robustness and Effectiveness in Dense Retrieval [111.58315434849047]
Robustness and effectiveness are critical aspects of developing dense retrieval models for real-world applications.<n>Recent work has addressed scaling laws of effectiveness in dense retrieval, revealing a power-law relationship between effectiveness and the size of models and data.<n>We find that robustness and effectiveness exhibit different scaling patterns, leading to significant resource costs when jointly improving both.
arXiv Detail & Related papers (2025-05-30T06:57:27Z) - EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z) - R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs.
Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z) - SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models [17.483183039447564]
This paper introduces Sparse Expert Activation Pruning (SEAP), a training-free pruning method that selectively retains task-relevant parameters to reduce inference overhead.
Experimental results demonstrate that SEAP significantly reduces computational overhead while maintaining competitive accuracy.
arXiv Detail & Related papers (2025-03-10T17:59:03Z) - A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning [61.403275660120606]
Reinforcement learning (RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives.
We propose leave-one-out PPO (LOOP), a novel RL for diffusion fine-tuning method.
Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives, and achieves a better balance between computational efficiency and performance.
arXiv Detail & Related papers (2025-03-02T13:43:53Z) - Leveraging the true depth of LLMs [46.81174316936993]
Large Language Models (LLMs) demonstrate remarkable capabilities at the cost of high compute requirements.<n>Recent studies have demonstrated that intermediate layers in LLMs can be removed or reordered without substantial accuracy loss.<n>We propose a novel method that groups consecutive layers into pairs evaluated in parallel.
arXiv Detail & Related papers (2025-02-05T00:26:27Z) - Fine Tuning without Catastrophic Forgetting via Selective Low Rank Adaptation [13.084333776247743]
Fine-tuning can reduce robustness to distribution shifts, impacting out-of-distribution (OOD) performance.
We propose a parameter-efficient fine-tuning (PEFT) method, using an indicator function to selectively activate Low-Rank Adaptation (LoRA) blocks.
We demonstrate that effective fine-tuning can be achieved with as few as 5% of active blocks, substantially improving efficiency.
arXiv Detail & Related papers (2025-01-26T03:22:22Z) - Refining Salience-Aware Sparse Fine-Tuning Strategies for Language Models [14.68920095399595]
sparsity-based PEFT (SPEFT) introduces trainable sparse adaptations to the weight matrices in the model.
We conduct the first systematic evaluation of salience metrics for SPEFT, inspired by zero-cost NAS proxies.
Our work challenges the notion that complexity is necessary for effective PEFT.
arXiv Detail & Related papers (2024-12-18T04:14:35Z) - Efficient Source-Free Time-Series Adaptation via Parameter Subspace Disentanglement [0.7558576228782637]
We propose a framework for efficient Source-Free Domain Adaptation (SFDA)
Our approach introduces an improved paradigm for source-model preparation and target-side adaptation.
We demonstrate that our framework is compatible with various SFDA methods and achieves significant computational efficiency.
arXiv Detail & Related papers (2024-10-03T02:12:03Z) - EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models [29.57891007810509]
Large Language Models (LLMs) have demonstrated outstanding performance across a variety of natural language processing tasks.
We introduce EchoAtt, a novel framework aimed at optimizing transformer-based models by analyzing and leveraging the similarity of attention patterns across layers.
Our best results with TinyLLaMA-1.1B demonstrate that EchoAtt improves inference speed by 15%, training speed by 25%, and reduces the number of parameters by approximately 4%, all while improving zero-shot performance.
arXiv Detail & Related papers (2024-09-22T21:08:37Z) - Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training.
We propose three effective strategies to enhance LLM performance within a fixed compute budget.
Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models.
We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - Sine Activated Low-Rank Matrices for Parameter Efficient Learning [25.12262017296922]
We propose a novel theoretical framework that integrates a sinusoidal function within the low-rank decomposition process.
Our method proves to be an enhancement for existing low-rank models, as evidenced by its successful application in Vision Transformers (ViT), Large Language Models (LLMs), Neural Radiance Fields (NeRF)
arXiv Detail & Related papers (2024-03-28T08:58:20Z) - L3 Ensembles: Lifelong Learning Approach for Ensemble of Foundational
Language Models [15.726224465017596]
We propose an approach that focuses on extracting meaningful representations from unseen data and constructing a structured knowledge base.
We conducted experiments on various NLP tasks to validate its effectiveness, including benchmarks like GLUE and SuperGLUE.
The proposed L3 ensemble method increases the model accuracy by 4% 36% compared to the fine-tuned FLM.
arXiv Detail & Related papers (2023-11-11T06:59:50Z) - Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via
Instruction Tuning with LITE [62.13435256279566]
Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks.
However, their large size makes their inference slow and computationally expensive.
We show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer.
arXiv Detail & Related papers (2023-10-28T04:07:58Z) - Federated Learning of Large Language Models with Parameter-Efficient
Prompt Tuning and Adaptive Optimization [71.87335804334616]
Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data.
The training process of Large Language Models (LLMs) generally incurs the update of significant parameters.
This paper proposes an efficient partial prompt tuning approach to improve performance and efficiency simultaneously.
arXiv Detail & Related papers (2023-10-23T16:37:59Z) - E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive.
We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation.
Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z) - Dynamic Transformers Provide a False Sense of Efficiency [75.39702559746533]
Multi-exit models make a trade-off between efficiency and accuracy, where the saving of computation comes from an early exit.
We propose a simple yet effective attacking framework, SAME, which is specially tailored to reduce the efficiency of the multi-exit models.
Experiments on the GLUE benchmark show that SAME can effectively diminish the efficiency gain of various multi-exit models by 80% on average.
arXiv Detail & Related papers (2023-05-20T16:41:48Z) - SASL: Saliency-Adaptive Sparsity Learning for Neural Network
Acceleration [20.92912642901645]
We propose a Saliency-Adaptive Sparsity Learning (SASL) approach for further optimization.
Our method can reduce 49.7% FLOPs of ResNet-50 with very negligible 0.39% top-1 and 0.05% top-5 accuracy degradation.
arXiv Detail & Related papers (2020-03-12T16:49:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.