Frustratingly Simple Memory Efficiency for Pre-trained Language Models
via Dynamic Embedding Pruning
- URL: http://arxiv.org/abs/2309.08708v1
- Date: Fri, 15 Sep 2023 19:00:00 GMT
- Title: Frustratingly Simple Memory Efficiency for Pre-trained Language Models
via Dynamic Embedding Pruning
- Authors: Miles Williams, Nikolaos Aletras
- Abstract summary: The memory footprint of pre-trained language models (PLMs) can hinder deployment in memory-constrained settings.
We propose a simple yet effective approach that leverages this finding to minimize the memory footprint of the embedding matrix.
We show that this approach provides substantial reductions in memory usage across a wide range of models and tasks.
- Score: 42.652021176354644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The extensive memory footprint of pre-trained language models (PLMs) can
hinder deployment in memory-constrained settings, such as cloud environments or
on-device. PLMs use embedding matrices to represent extensive vocabularies,
forming a large proportion of the model parameters. While previous work towards
parameter-efficient PLM development has considered pruning parameters within
the transformer layers, pruning the embedding matrix as part of fine-tuning or
inference has yet to be explored. We first demonstrate that a significant
proportion of the vocabulary remains unused in these scenarios. We then propose
a simple yet effective approach that leverages this finding to minimize the
memory footprint of the embedding matrix. We show that this approach provides
substantial reductions in memory usage across a wide range of models and tasks.
Notably, our approach maintains equivalent downstream task performance while
allowing a more efficient use of compute resources.
Related papers
- Expanding Sparse Tuning for Low Memory Usage [103.43560327427647]
We propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage.
To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices.
A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes.
arXiv Detail & Related papers (2024-11-04T04:58:20Z) - SVFit: Parameter-Efficient Fine-Tuning of Large Pre-Trained Models Using Singular Values [12.137869917556415]
Large pre-trained models (LPMs) have demonstrated exceptional performance in diverse natural language processing and computer vision tasks.
fully fine-tuning these models poses substantial memory challenges, particularly in resource-constrained environments.
We propose SVFit, a novel PEFT approach that leverages singular value decomposition (SVD) to low-rank matrices using critical singular values as trainable parameters.
arXiv Detail & Related papers (2024-09-09T08:44:53Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach [87.8330887605381]
We show how to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters.
We synthesize a task-specific query with a learnable and lightweight module, which is independent of the pre-trained model.
Our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations.
arXiv Detail & Related papers (2024-07-09T15:45:04Z) - Athena: Efficient Block-Wise Post-Training Quantization for Large Language Models Using Second-Order Matrix Derivative Information [5.756323337411276]
Large Language Models (LLMs) have advanced natural language processing tasks such as machine translation, text generation, and sentiment analysis.
Their large size, often consisting of billions of parameters, poses challenges for storage, computation, and deployment.
We propose Athena, a novel algorithm for efficient block-wise post-training quantization of LLMs.
arXiv Detail & Related papers (2024-05-24T03:14:29Z) - Make Pre-trained Model Reversible: From Parameter to Memory Efficient
Fine-Tuning [6.451743797015637]
We propose memory-efficient fine-tuning (MEFT) for pre-trained language models.
MEFT inserts adapters into a PLM, preserving the PLM's starting point and making it reversible without additional pre-training.
MEFT significantly reduces the activation memory up to 84% of full fine-tuning with a negligible amount of trainable parameters.
arXiv Detail & Related papers (2023-06-01T09:26:17Z) - Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers [29.319666323947708]
We present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness.
Our method employs a learnable mechanism that determines which uninformative tokens can be dropped from the context.
Our reference implementation achieves up to $2times$ increase in inference throughput and even greater memory savings.
arXiv Detail & Related papers (2023-05-25T07:39:41Z) - Numerical Optimizations for Weighted Low-rank Estimation on Language
Model [73.12941276331316]
Singular value decomposition (SVD) is one of the most popular compression methods that approximates a target matrix with smaller matrices.
Standard SVD treats the parameters within the matrix with equal importance, which is a simple but unrealistic assumption.
We show that our method can perform better than current SOTA methods in neural-based language models.
arXiv Detail & Related papers (2022-11-02T00:58:02Z) - Parameter-Efficient Sparsity for Large Language Models Fine-Tuning [63.321205487234074]
We propose a.
sparse-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training.
Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) demonstrate PST performs on par or better than previous sparsity methods.
arXiv Detail & Related papers (2022-05-23T02:43:45Z) - Direction is what you need: Improving Word Embedding Compression in
Large Language Models [7.736463504706344]
This paper presents a novel loss objective to compress token embeddings in Transformer-based models by leveraging an AutoEncoder architecture.
Our method significantly outperforms the commonly used SVD-based matrix-factorization approach in terms of initial language model Perplexity.
arXiv Detail & Related papers (2021-06-15T14:28:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.