Masking as an Efficient Alternative to Finetuning for Pretrained
Language Models
- URL: http://arxiv.org/abs/2004.12406v2
- Date: Sun, 11 Oct 2020 11:52:08 GMT
- Title: Masking as an Efficient Alternative to Finetuning for Pretrained
Language Models
- Authors: Mengjie Zhao, Tao Lin, Fei Mi, Martin Jaggi, Hinrich Sch\"utze
- Abstract summary: We learn selective binary masks for pretrained weights in lieu of modifying them through finetuning.
In intrinsic evaluations, we show that representations computed by masked language models encode information necessary for solving downstream tasks.
- Score: 49.64561153284428
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present an efficient method of utilizing pretrained language models, where
we learn selective binary masks for pretrained weights in lieu of modifying
them through finetuning. Extensive evaluations of masking BERT and RoBERTa on a
series of NLP tasks show that our masking scheme yields performance comparable
to finetuning, yet has a much smaller memory footprint when several tasks need
to be inferred simultaneously. Through intrinsic evaluations, we show that
representations computed by masked language models encode information necessary
for solving downstream tasks. Analyzing the loss landscape, we show that
masking and finetuning produce models that reside in minima that can be
connected by a line segment with nearly constant test accuracy. This confirms
that masking can be utilized as an efficient alternative to finetuning.
Related papers
- Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text [27.320746607958142]
Masked language modeling has become a widely adopted unsupervised technique to pre-train language models.
We propose to adjust the masking ratio and to decide which tokens to mask based on a novel task-informed anti-curriculum learning scheme.
arXiv Detail & Related papers (2025-02-18T15:36:16Z) - Instruction-Following Pruning for Large Language Models [58.329978053711024]
We move beyond the traditional static pruning approach of determining a fixed pruning mask for a model.
In our method, the pruning mask is input-dependent and adapts dynamically based on the information described in a user instruction.
Our approach, termed "instruction-following pruning", introduces a sparse mask predictor that takes the user instruction as input and dynamically selects the most relevant model parameters for the given task.
arXiv Detail & Related papers (2025-01-03T20:19:14Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training [55.12082817901671]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet, compared to baselines and competitors under the same model setting.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning.
We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z) - Meta Mask Correction for Nuclei Segmentation in Histopathological Image [5.36728433027615]
We propose a novel meta-learning-based nuclei segmentation method to leverage data with noisy masks.
Specifically, we design a fully conventional meta-model that can correct noisy masks using a small amount of clean meta-data.
We show that our method achieves the state-of-the-art result.
arXiv Detail & Related papers (2021-11-24T13:53:35Z) - Train No Evil: Selective Masking for Task-Guided Pre-Training [97.03615486457065]
We propose a three-stage framework by adding a task-guided pre-training stage with selective masking between general pre-training and fine-tuning.
We show that our method can achieve comparable or even better performance with less than 50% of cost.
arXiv Detail & Related papers (2020-04-21T03:14:22Z) - Semi-Autoregressive Training Improves Mask-Predict Decoding [119.8412758943192]
We introduce a new training method for conditional masked language models, SMART, which mimics the semi-autoregressive behavior of mask-predict.
Models trained with SMART produce higher-quality translations when using mask-predict decoding, effectively closing the remaining performance gap with fully autoregressive models.
arXiv Detail & Related papers (2020-01-23T19:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.