Frequency Is What You Need: Word-frequency Masking Benefits Vision-Language Model Pre-training
- URL: http://arxiv.org/abs/2412.16148v2
- Date: Mon, 14 Apr 2025 12:55:19 GMT
- Title: Frequency Is What You Need: Word-frequency Masking Benefits Vision-Language Model Pre-training
- Authors: Mingliang Liang, Martha Larson,
- Abstract summary: Vision Language Models (VLMs) can be trained more efficiently if training sets can be reduced in size.<n>Recent work has shown the benefits of masking text during VLM training using a variety of approaches.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Language Models (VLMs) can be trained more efficiently if training sets can be reduced in size. Recent work has shown the benefits of masking text during VLM training using a variety of approaches: truncation, random masking, block masking and syntax masking. In this paper, we show that the best masking strategy changes over training epochs and that, given sufficient training epochs. We analyze existing text masking approaches including syntax masking, which is currently the state of the art, and identify the word frequency distribution as important in determining their success. Experiments on a large range of data sets demonstrate that syntax masking is outperformed by other approaches, given sufficient epochs, and that our proposed frequency-based approach, called Contrastive Language-Image Pre-training with Word Frequency Masking (CLIPF) has numerous advantages. The benefits are particularly evident as the number of input tokens decreases.
Related papers
- Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text [27.320746607958142]
Masked language modeling has become a widely adopted unsupervised technique to pre-train language models.
We propose to adjust the masking ratio and to decide which tokens to mask based on a novel task-informed anti-curriculum learning scheme.
arXiv Detail & Related papers (2025-02-18T15:36:16Z) - Enhancing Vision-Language Model Pre-training with Image-text Pair Pruning Based on Word Frequency [0.0]
We propose Word-Frequency-based Image-Text Pair Pruning (WFPP), a novel data pruning method.<n>WFPP prunes text-image pairs containing high-frequency words across the entire training dataset.<n>Our experiments demonstrate that applying WFPP when training a CLIP model improves performance on a wide range of downstream tasks.
arXiv Detail & Related papers (2024-10-09T11:54:41Z) - Efficient Vision-Language Pre-training by Cluster Masking [13.845233914223561]
We propose a simple strategy for masking image patches during visual-language contrastive learning.
We randomly mask clusters of visually similar image patches, as measured by their raw pixel intensities.
This provides an extra learning signal, beyond the contrastive training itself, since it forces a model to predict words for masked visual structures solely from context.
arXiv Detail & Related papers (2024-05-14T17:59:40Z) - Self-Evolution Learning for Discriminative Language Model Pretraining [103.57103957631067]
Self-Evolution learning (SE) is a simple and effective token masking and learning method.
SE focuses on learning the informative yet under-explored tokens and adaptively regularizes the training by introducing a novel Token-specific Label Smoothing approach.
arXiv Detail & Related papers (2023-05-24T16:00:54Z) - Uniform Masking Prevails in Vision-Language Pretraining [26.513450527203453]
Masked Language Modeling (MLM) has proven to be an essential component of Vision-Language (VL) pretraining.
This paper shows that increasing the masking rate leads to gains in Image-Text Matching (ITM) tasks.
arXiv Detail & Related papers (2022-12-10T04:02:19Z) - Scaling Language-Image Pre-training via Masking [63.36988191660858]
Fast Language-Image Pre-training (FLIP) is a simple and more efficient method for training CLIP.
Masking allows us to learn from more image-text pairs given the same wall-clock time.
FLIP dominantly outperforms CLIP counterparts trained on the same data.
arXiv Detail & Related papers (2022-12-01T18:59:57Z) - Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training.
We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training)
The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z) - Learning Better Masking for Better Language Model Pre-training [80.31112722910787]
Masked Language Modeling has been widely used as denoising objective in pre-training language models (PrLMs)
PrLMs commonly adopt a Random-Token Masking strategy where a fixed masking ratio is applied and different contents are masked by an equal probability throughout the entire training.
We propose two scheduled masking approaches that adaptively tune the masking ratio and masked content in different training stages.
arXiv Detail & Related papers (2022-08-23T08:27:52Z) - Open-Vocabulary Instance Segmentation via Robust Cross-Modal
Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations.
We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images.
Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z) - Data Efficient Masked Language Modeling for Vision and Language [16.95631509102115]
Masked language modeling (MLM) is one of the key sub-tasks in vision-language training.
In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text.
We investigate a range of alternative masking strategies specific to the cross-modal setting that address these shortcomings.
arXiv Detail & Related papers (2021-09-05T11:27:53Z) - Neural Mask Generator: Learning to Generate Adaptive Word Maskings for
Language Model Adaptation [63.195935452646815]
We propose a method to automatically generate a domain- and task-adaptive maskings of the given text for self-supervised pre-training.
We present a novel reinforcement learning-based framework which learns the masking policy.
We validate our Neural Mask Generator (NMG) on several question answering and text classification datasets.
arXiv Detail & Related papers (2020-10-06T13:27:01Z) - Masking as an Efficient Alternative to Finetuning for Pretrained
Language Models [49.64561153284428]
We learn selective binary masks for pretrained weights in lieu of modifying them through finetuning.
In intrinsic evaluations, we show that representations computed by masked language models encode information necessary for solving downstream tasks.
arXiv Detail & Related papers (2020-04-26T15:03:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.