Multi-stage Pre-training over Simplified Multimodal Pre-training Models
- URL: http://arxiv.org/abs/2107.14596v1
- Date: Thu, 22 Jul 2021 03:35:27 GMT
- Title: Multi-stage Pre-training over Simplified Multimodal Pre-training Models
- Authors: Tongtong Liu and Fangxiang Feng and Xiaojie Wang
- Abstract summary: We propose a new Multi-stage Pre-training (MSP) method, which uses information at different granularities from word, phrase to sentence in both texts and images to pre-train the model in stages.
We also design several different pre-training tasks suitable for the information granularity in different stage in order to efficiently capture the diverse knowledge from a limited corpus.
Experimental results show that our method achieves comparable performance to the original LXMERT model in all downstream tasks, and even outperforms the original model in Image-Text Retrieval task.
- Score: 35.644196343835674
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal pre-training models, such as LXMERT, have achieved excellent
results in downstream tasks. However, current pre-trained models require large
amounts of training data and have huge model sizes, which make them difficult
to apply in low-resource situations. How to obtain similar or even better
performance than a larger model under the premise of less pre-training data and
smaller model size has become an important problem. In this paper, we propose a
new Multi-stage Pre-training (MSP) method, which uses information at different
granularities from word, phrase to sentence in both texts and images to
pre-train the model in stages. We also design several different pre-training
tasks suitable for the information granularity in different stage in order to
efficiently capture the diverse knowledge from a limited corpus. We take a
Simplified LXMERT (LXMERT- S), which has only 45.9% parameters of the original
LXMERT model and 11.76% of the original pre-training data as the testbed of our
MSP method. Experimental results show that our method achieves comparable
performance to the original LXMERT model in all downstream tasks, and even
outperforms the original model in Image-Text Retrieval task.
Related papers
- Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management [35.06717005729781]
Recent foundation models are capable of handling multiple machine learning (ML) tasks and multiple data modalities with the unified base model structure and several specialized model components.
Development of such multi-task (MT) multi-modal (MM) models poses significant model management challenges to existing training systems.
We build a prototype system and evaluate it on various large MT MM models.
Experiments demonstrate the superior performance and efficiency of our system, with speedup ratio up to 71% compared to state-of-the-art training systems.
arXiv Detail & Related papers (2024-09-05T09:10:40Z) - POA: Pre-training Once for Models of All Sizes [33.72644336390202]
We propose a novel tri-branch self-supervised training framework, termed as POA (Pre-training Once for All)
Our approach introduces an innovative elastic student branch into a modern self-distillation paradigm.
It achieves state-of-the-art performance using ViT, Swin Transformer and ResNet backbones.
arXiv Detail & Related papers (2024-08-02T06:13:29Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey [66.18478838828231]
Multi-modal pre-trained big models have drawn more and more attention in recent years.
This paper introduces the background of multi-modal pre-training by reviewing the conventional deep, pre-training works in natural language process, computer vision, and speech.
Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network, and knowledge enhanced pre-training.
arXiv Detail & Related papers (2023-02-20T15:34:03Z) - Model ensemble instead of prompt fusion: a sample-specific knowledge
transfer method for few-shot prompt tuning [85.55727213502402]
We focus on improving the few-shot performance of prompt tuning by transferring knowledge from soft prompts of source tasks.
We propose Sample-specific Ensemble of Source Models (SESoM)
SESoM learns to adjust the contribution of each source model for each target sample separately when ensembling source model outputs.
arXiv Detail & Related papers (2022-10-23T01:33:16Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Improving Non-autoregressive Generation with Mixup Training [51.61038444990301]
We present a non-autoregressive generation model based on pre-trained transformer models.
We propose a simple and effective iterative training method called MIx Source and pseudo Target.
Our experiments on three generation benchmarks including question generation, summarization and paraphrase generation, show that the proposed framework achieves the new state-of-the-art results.
arXiv Detail & Related papers (2021-10-21T13:04:21Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - A Comparison of LSTM and BERT for Small Corpus [0.0]
Recent advancements in the NLP field showed that transfer learning helps with achieving state-of-the-art results for new tasks by tuning pre-trained models instead of starting from scratch.
In this paper we focus on a real-life scenario that scientists in academia and industry face frequently: given a small dataset, can we use a large pre-trained model like BERT and get better results than simple models?
Our experimental results show that bidirectional LSTM models can achieve significantly higher results than a BERT model for a small dataset and these simple models get trained in much less time than tuning the pre-trained counterparts.
arXiv Detail & Related papers (2020-09-11T14:01:14Z) - ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised
Image-Text Data [9.3935916515127]
We introduce a new vision-supervised pre-trained model -- ImageBERT -- for image-text joint embedding.
Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them.
arXiv Detail & Related papers (2020-01-22T11:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.