Accelerating Vision-Language Pretraining with Free Language Modeling
- URL: http://arxiv.org/abs/2303.14038v1
- Date: Fri, 24 Mar 2023 14:49:22 GMT
- Title: Accelerating Vision-Language Pretraining with Free Language Modeling
- Authors: Teng Wang, Yixiao Ge, Feng Zheng, Ran Cheng, Ying Shan, Xiaohu Qie,
Ping Luo
- Abstract summary: Free language modeling (FLM) enables a 100% prediction rate with arbitrary corruption rates.
FLM successfully frees the prediction rate from the tie-up with the corruption rate.
Experiments show FLM could achieve an impressive 2.5x pretraining time reduction.
- Score: 62.30042851111692
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The state of the arts in vision-language pretraining (VLP) achieves exemplary
performance but suffers from high training costs resulting from slow
convergence and long training time, especially on large-scale web datasets. An
essential obstacle to training efficiency lies in the entangled prediction rate
(percentage of tokens for reconstruction) and corruption rate (percentage of
corrupted tokens) in masked language modeling (MLM), that is, a proper
corruption rate is achieved at the cost of a large portion of output tokens
being excluded from prediction loss. To accelerate the convergence of VLP, we
propose a new pretraining task, namely, free language modeling (FLM), that
enables a 100% prediction rate with arbitrary corruption rates. FLM
successfully frees the prediction rate from the tie-up with the corruption rate
while allowing the corruption spans to be customized for each token to be
predicted. FLM-trained models are encouraged to learn better and faster given
the same GPU time by exploiting bidirectional contexts more flexibly. Extensive
experiments show FLM could achieve an impressive 2.5x pretraining time
reduction in comparison to the MLM-based methods, while keeping competitive
performance on both vision-language understanding and generation tasks. Code
will be public at https://github.com/TencentARC/FLM.
Related papers
- ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs [1.1834200163382398]
ReGATE (Reference$-$Guided Adaptive Token Elision) is an adaptive token pruning method for accelerating MLLM training.<n>It matches the peak accuracy of standard training on MVBench up to 2$times$ faster, using only 35% of the tokens.
arXiv Detail & Related papers (2025-07-29T01:07:09Z) - ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining [53.893792844055106]
Large language model pretraining is compute-intensive, yet many tokens contribute marginally to learning, resulting in inefficiency.<n>We introduce Selective Efficient Language Modeling, a risk-aware algorithm that improves training efficiency and distributional robustness by performing online token-level batch selection.<n> Experiments on GPT-2 pretraining show that ESLM significantly reduces training FLOPs while maintaining or improving both perplexity and downstream performance compared to baselines.
arXiv Detail & Related papers (2025-05-26T12:23:26Z) - Patch-Level Training for Large Language Models [69.67438563485887]
This paper introduces patch-level training for Large Language Models (LLMs)
During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch.
Following this, the model continues token-level training on the remaining training data to align with the inference mode.
arXiv Detail & Related papers (2024-07-17T15:48:39Z) - MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models [40.992566245706996]
We propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens.
We train generative language models at different scales of 468M, 1.2B, and 6.7B parameters.
Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks.
arXiv Detail & Related papers (2023-10-30T13:33:21Z) - Tokenizer Choice For LLM Training: Negligible or Crucial? [30.33170936148845]
We study the influence of tokenizer choice on Large Language Models (LLMs) downstream performance by training 24 mono- and multilingual LLMs.
We find that the tokenizer choice can significantly impact the model's downstream performance and training costs.
We show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English.
arXiv Detail & Related papers (2023-10-12T22:44:19Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z) - Bridging the Gap between Language Models and Cross-Lingual Sequence
Labeling [101.74165219364264]
Large-scale cross-lingual pre-trained language models (xPLMs) have shown effectiveness in cross-lingual sequence labeling tasks.
Despite the great success, we draw an empirical observation that there is a training objective gap between pre-training and fine-tuning stages.
In this paper, we first design a pre-training task tailored for xSL named Cross-lingual Language Informative Span Masking (CLISM) to eliminate the objective gap.
Second, we present ContrAstive-Consistency Regularization (CACR), which utilizes contrastive learning to encourage the consistency between representations of input parallel
arXiv Detail & Related papers (2022-04-11T15:55:20Z) - From Good to Best: Two-Stage Training for Cross-lingual Machine Reading
Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance.
The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer.
The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z) - Adversarial Training with Contrastive Learning in NLP [0.0]
We propose adversarial training with contrastive learning (ATCL) to adversarially train a language processing task.
The core idea is to make linear perturbations in the embedding space of the input via fast gradient methods (FGM) and train the model to keep the original and perturbed representations close via contrastive learning.
The results show not only an improvement in the quantitative (perplexity and BLEU) scores when compared to the baselines, but ATCL also achieves good qualitative results in the semantic level for both tasks.
arXiv Detail & Related papers (2021-09-19T07:23:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.