Advantageous Parameter Expansion Training Makes Better Large Language Models
- URL: http://arxiv.org/abs/2505.24241v1
- Date: Fri, 30 May 2025 06:06:23 GMT
- Title: Advantageous Parameter Expansion Training Makes Better Large Language Models
- Authors: Naibin Gu, Yilong Chen, Zhenyu Zhang, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang,
- Abstract summary: A subset of parameters, termed advantageous parameters, plays a crucial role in determining model performance.<n>We propose Advantageous EXpansion Training (APEX), a method that progressively expands advantageous parameters into the space of disadvantageous ones.<n>APEX achieves the same perplexity level as conventional training with just 33% of the training data, and yields significant improvements on downstream tasks.
- Score: 50.82647159657912
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although scaling up the number of trainable parameters in both pre-training and fine-tuning can effectively improve the performance of large language models, it also leads to increased computational overhead. When delving into the parameter difference, we find that a subset of parameters, termed advantageous parameters, plays a crucial role in determining model performance. Further analysis reveals that stronger models tend to possess more such parameters. In this paper, we propose Advantageous Parameter EXpansion Training (APEX), a method that progressively expands advantageous parameters into the space of disadvantageous ones, thereby increasing their proportion and enhancing training effectiveness. Further theoretical analysis from the perspective of matrix effective rank explains the performance gains of APEX. Extensive experiments on both instruction tuning and continued pre-training demonstrate that, in instruction tuning, APEX outperforms full-parameter tuning while using only 52% of the trainable parameters. In continued pre-training, APEX achieves the same perplexity level as conventional training with just 33% of the training data, and yields significant improvements on downstream tasks.
Related papers
- Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization [31.379237532476875]
Pre-trained large language models (LLMs) are commonly fine-tuned to adapt to downstream tasks.<n>In this paper, we propose a multi-stage influence function to attribute predictions of fine-tuned LLMs to pre-training data.
arXiv Detail & Related papers (2025-05-08T07:43:44Z) - STEP: Staged Parameter-Efficient Pre-training for Large Language Models [16.77087225406202]
Pre-training large language models (LLMs) faces significant memory challenges due to the large size of model parameters.<n>We introduce STaged parameter-Efficient Pre-training (STEP), which integrates parameter-efficient tuning techniques with model growth.
arXiv Detail & Related papers (2025-04-05T12:07:08Z) - A Hessian-informed hyperparameter optimization for differential learning rate [10.43211367988483]
Hessian-informed differential learning rate (Hi-DLR) is a technique that applies different learning rates to different model parameters.<n>We show that Hi-DLR can improve the convergence by dynamically determining the learning rates during the training.
arXiv Detail & Related papers (2025-01-12T22:21:06Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - Pre-training Everywhere: Parameter-Efficient Fine-Tuning for Medical Image Analysis via Target Parameter Pre-training [17.433808197776003]
We propose a simple yet effective fine-tuning framework based on Target Pre-training (TPP)
TPP includes an additional stage before PEFT to pre-train these target parameters.
TPP can be easily integrated into existing PEFT methods, significantly improving performance.
arXiv Detail & Related papers (2024-08-27T12:48:46Z) - Forecast-PEFT: Parameter-Efficient Fine-Tuning for Pre-trained Motion Forecasting Models [68.23649978697027]
Forecast-PEFT is a fine-tuning strategy that freezes the majority of the model's parameters, focusing adjustments on newly introduced prompts and adapters.
Our experiments show that Forecast-PEFT outperforms traditional full fine-tuning methods in motion prediction tasks.
Forecast-FT further improves prediction performance, evidencing up to a 9.6% enhancement over conventional baseline methods.
arXiv Detail & Related papers (2024-07-28T19:18:59Z) - SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information [3.6859322366469933]
Methods like ReLoRA and GaLore have attempted to address this challenge by updating the low-rank subspace.<n>In this paper, we introduce SwitchLoRA, a parameter-efficient training technique that frequently and smoothly replaces the trainable parameters of LoRA with alternative parameters.
arXiv Detail & Related papers (2024-06-03T05:40:34Z) - The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis [27.310894780313618]
This paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints.
We confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes.
In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints.
arXiv Detail & Related papers (2024-04-01T16:00:01Z) - APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference [63.52244442498831]
Fine-tuning and inference with large Language Models (LMs) are generally known to be expensive.
We introduce APT that adaptively prunes and tunes parameters for the LMs.
We show that APT speeds up LMs fine-tuning by up to 8x and reduces large LMs memory training footprint by up to 70%.
arXiv Detail & Related papers (2024-01-22T18:39:40Z) - Scaling & Shifting Your Features: A New Baseline for Efficient Model
Tuning [126.84770886628833]
Existing finetuning methods either tune all parameters of the pretrained model (full finetuning) or only tune the last linear layer (linear probing)
We propose a new parameter-efficient finetuning method termed as SSF, representing that researchers only need to Scale and Shift the deep Features extracted by a pre-trained model to catch up with the performance full finetuning.
arXiv Detail & Related papers (2022-10-17T08:14:49Z) - Know Where You're Going: Meta-Learning for Parameter-Efficient
Fine-tuning [34.66092282348687]
We show that taking the ultimate choice of fine-tuning method into consideration boosts the performance of parameter-efficient fine-tuning.
We prime the pretrained model specifically for parameter-efficient fine-tuning, resulting in gains of up to 1.7 points on cross-lingual NER fine-tuning.
arXiv Detail & Related papers (2022-05-25T02:51:57Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z) - Rethinking embedding coupling in pre-trained language models [46.11201932668366]
We re-evaluate the standard practice of sharing weights between input and output embeddings in pre-trained language models.
We show that decoupled embeddings provide increased modeling flexibility, allowing us to significantly improve the efficiency of parameter allocation.
We are able to train models that achieve strong performance on the XTREME benchmark without increasing the number of parameters at the fine-tuning stage.
arXiv Detail & Related papers (2020-10-24T07:43:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.