Related papers: Large Product Key Memory for Pretrained Language Models

Large Product Key Memory for Pretrained Language Models

URL: http://arxiv.org/abs/2010.03881v1
Date: Thu, 8 Oct 2020 10:19:50 GMT
Title: Large Product Key Memory for Pretrained Language Models
Authors: Gyuwan Kim and Tae-Hwan Jung
Abstract summary: Product key memory (PKM) enables to improve prediction accuracy by increasing model capacity efficiently with insignificant computational overhead. Motivated by the recent success of pretrained language models (PLMs), we investigate how to incorporate large PKM into PLMs that can be fine for a wide variety of downstream NLP tasks.
Score: 12.932177565788974
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Product key memory (PKM) proposed by Lample et al. (2019) enables to improve prediction accuracy by increasing model capacity efficiently with insignificant computational overhead. However, their empirical application is only limited to causal language modeling. Motivated by the recent success of pretrained language models (PLMs), we investigate how to incorporate large PKM into PLMs that can be finetuned for a wide variety of downstream NLP tasks. We define a new memory usage metric, and careful observation using this metric reveals that most memory slots remain outdated during the training of PKM-augmented models. To train better PLMs by tackling this issue, we propose simple but effective solutions: (1) initialization from the model weights pretrained without memory and (2) augmenting PKM by addition rather than replacing a feed-forward network. We verify that both of them are crucial for the pretraining of PKM-augmented PLMs, enhancing memory utilization and downstream performance. Code and pretrained weights are available at https://github.com/clovaai/pkm-transformers.

Related papers

Should We Still Pretrain Encoders with Masked Language Modeling? [27.19054714197245]
Recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders.<n>We train a total of 38 models ranging from 210 million to 1 billion parameters, and conduct over 15,000 fine-tuning and evaluation runs.<n>We find that while training with high-level CLM yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability.
arXiv Detail & Related papers (2025-07-01T17:45:48Z)
Direct Quantized Training of Language Models with Stochastic Rounding [12.028887152979046]
This paper explores the potential of directly updating the quantized low-precision weight matrices without relying on the straight-through estimator during backpropagation. Experimental results on our LLaMA-structured models indicate that training with only low-precision weights is feasible even when they are constrained to ternary values. Our models can also perform inference using ternary weights, showcasing their flexibility in deployment.
arXiv Detail & Related papers (2024-12-06T05:41:11Z)
EMP: Enhance Memory in Data Pruning [18.535687216213628]
Recently, large language and vision models have shown strong performance, but due to high pre-training and fine-tuning costs, research has shifted towards faster training via dataset pruning. Previous methods used sample loss as an evaluation criterion, aiming to select the most "difficult" samples for training. We propose Enhance Memory Pruning (EMP), which addresses the issue of insufficient memory under high pruning rates by enhancing the model's memory of data, thereby improving its performance.
arXiv Detail & Related papers (2024-08-28T10:29:52Z)
MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric [57.3330687266266]
We find that using smaller pre-trained models and applying magnitude-based pruning on CLIP models leads to inflexibility and inferior performance. Using the Module-wise Pruning Error (MoPE) metric, we introduce a unified pruning framework applicable to both pre-training and task-specific fine-tuning compression stages.
arXiv Detail & Related papers (2024-03-12T17:24:26Z)
Frustratingly Simple Memory Efficiency for Pre-trained Language Models via Dynamic Embedding Pruning [42.652021176354644]
The memory footprint of pre-trained language models (PLMs) can hinder deployment in memory-constrained settings. We propose a simple yet effective approach that leverages this finding to minimize the memory footprint of the embedding matrix. We show that this approach provides substantial reductions in memory usage across a wide range of models and tasks.
arXiv Detail & Related papers (2023-09-15T19:00:00Z)
MiniLLM: Knowledge Distillation of Large Language Models [112.93051247165089]
Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs) We propose a KD approach that distills LLMs into smaller language models. Our method is scalable for different model families with 120M to 13B parameters.
arXiv Detail & Related papers (2023-06-14T14:44:03Z)
Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning [6.451743797015637]
We propose memory-efficient fine-tuning (MEFT) for pre-trained language models. MEFT inserts adapters into a PLM, preserving the PLM's starting point and making it reversible without additional pre-training. MEFT significantly reduces the activation memory up to 84% of full fine-tuning with a negligible amount of trainable parameters.
arXiv Detail & Related papers (2023-06-01T09:26:17Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model. We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO) The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z)
bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model. bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z)
CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference. We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch. We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.