The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training
- URL: http://arxiv.org/abs/2502.19002v2
- Date: Fri, 13 Jun 2025 07:42:25 GMT
- Title: The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training
- Authors: Jinbo Wang, Mingze Wang, Zhanpeng Zhou, Junchi Yan, Weinan E, Lei Wu,
- Abstract summary: We propose Blockwise Learning Rate (LR), a strategy that tailors the LR to each block's sharpness.<n>We demonstrate this acceleration across GPT-2 and LLaMA, with model sizes ranging from 0.12B to 2B.<n>We incorporate Blockwise LR into Adam-mini, a recently proposed memory-efficient variant of Adam, achieving a combined $2times$ speedup and $2times$ memory saving.
- Score: 51.84624027213658
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers consist of diverse building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feedforward networks. Thus, understanding the differences and interactions among these blocks is important. In this paper, we uncover a clear Sharpness Disparity across these blocks, which emerges early in training and intriguingly persists throughout the training process. Motivated by this finding, we propose Blockwise Learning Rate (LR), a strategy that tailors the LR to each block's sharpness, accelerating large language model (LLM) pre-training. By integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2\times$ speedup compared to vanilla AdamW. We demonstrate this acceleration across GPT-2 and LLaMA, with model sizes ranging from 0.12B to 2B and datasets of OpenWebText, MiniPile, and C4. Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2\times$ speedup and $2\times$ memory saving. These results underscore the potential of exploiting the sharpness disparity to improve LLM training.
Related papers
- Scaling Embedding Layers in Language Models [52.47659840377581]
SCONE is a new method for extending input embedding layers to enhance language model performance.<n> embeddings provide contextualized representation for each input token and are learned with a separate model during training.<n>SCONE enables two new scaling strategies: increasing the number of $n$-gram embeddings and scaling the model used to learn them, both while maintaining fixed accelerator usage during inference.
arXiv Detail & Related papers (2025-02-03T18:59:32Z) - Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - A deeper look at depth pruning of LLMs [49.30061112976263]
Large Language Models (LLMs) are resource-intensive to train but more costly to deploy in production.
Recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance.
We show that adaptive metrics exhibit a trade-off in performance between tasks.
arXiv Detail & Related papers (2024-07-23T08:40:27Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - FoldGPT: Simple and Effective Large Language Model Compression Scheme [5.611544096046119]
Network bandwidth and memory limitations pose challenges for deploying billion-level models on mobile devices.
We propose FoldGPT, which combines block removal and block parameter sharing.
Experiments demonstrate that FoldGPT outperforms previous state-of-the-art(SOTA) methods in efficient model compression.
arXiv Detail & Related papers (2024-07-01T03:17:53Z) - Save It All: Enabling Full Parameter Tuning for Federated Large Language Models via Cycle Block Gradient Descent [15.463595798992621]
Large language models (LLMs) have revolutionized the deep learning paradigm, yielding impressive results across a wide array of tasks.
Existing solutions make the unrealistic assumption that the entire model is exchanged for training.
We introduce a novel method for the efficient training and fine-tuning of LLMs in FL, with minimal resource consumption.
arXiv Detail & Related papers (2024-06-17T03:49:44Z) - FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation.
To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed.
We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z) - BlockFUL: Enabling Unlearning in Blockchained Federated Learning [26.47424619448623]
Unlearning in Federated Learning (FL) presents significant challenges, as models grow and evolve with complex inheritance relationships.
In this paper, we introduce a novel framework with a dual-chain structure comprising a live chain and an archive chain for enabling unlearning capabilities withined FL.
Two new unlearning paradigms, i.e., parallel and sequential paradigms, can be effectively implemented through gradient-ascent-based and re-training-based unlearning methods.
Our experiments validate that these methods effectively reduce data dependency and operational overhead, thereby boosting the overall performance of unlearning inherited models within BlockFUL.
arXiv Detail & Related papers (2024-02-26T04:31:53Z) - Salsa Fresca: Angular Embeddings and Pre-Training for ML Attacks on
Learning With Errors [10.800552110718714]
Learning with Errors (LWE) is a hard math problem underlying post-quantum cryptography systems for key exchange and digital signatures.
Prior work proposed new machine learning (ML)-based attacks on LWE problems with small, sparse secrets, but these attacks require millions of LWE samples to train on and take days to recover secrets.
We propose three key methods -- better preprocessing, angular embeddings and model pre-training -- to improve these attacks.
arXiv Detail & Related papers (2024-02-02T00:48:27Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.