Towards All-in-one Pre-training via Maximizing Multi-modal Mutual
Information
- URL: http://arxiv.org/abs/2211.09807v2
- Date: Mon, 21 Nov 2022 17:46:53 GMT
- Title: Towards All-in-one Pre-training via Maximizing Multi-modal Mutual
Information
- Authors: Weijie Su, Xizhou Zhu, Chenxin Tao, Lewei Lu, Bin Li, Gao Huang, Yu
Qiao, Xiaogang Wang, Jie Zhou, Jifeng Dai
- Abstract summary: We propose an all-in-one single-stage pre-training approach, named Maximizing Multi-modal Mutual Information Pre-training (M3I Pre-training)
Our approach achieves better performance than previous pre-training methods on various vision benchmarks, including ImageNet classification, object detection, LVIS long-tailed object detection, and ADE20k semantic segmentation.
- Score: 77.80071279597665
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To effectively exploit the potential of large-scale models, various
pre-training strategies supported by massive data from different sources are
proposed, including supervised pre-training, weakly-supervised pre-training,
and self-supervised pre-training. It has been proved that combining multiple
pre-training strategies and data from various modalities/sources can greatly
boost the training of large-scale models. However, current works adopt a
multi-stage pre-training system, where the complex pipeline may increase the
uncertainty and instability of the pre-training. It is thus desirable that
these strategies can be integrated in a single-stage manner. In this paper, we
first propose a general multi-modal mutual information formula as a unified
optimization target and demonstrate that all existing approaches are special
cases of our framework. Under this unified perspective, we propose an
all-in-one single-stage pre-training approach, named Maximizing Multi-modal
Mutual Information Pre-training (M3I Pre-training). Our approach achieves
better performance than previous pre-training methods on various vision
benchmarks, including ImageNet classification, COCO object detection, LVIS
long-tailed object detection, and ADE20k semantic segmentation. Notably, we
successfully pre-train a billion-level parameter image backbone and achieve
state-of-the-art performance on various benchmarks. Code shall be released at
https://github.com/OpenGVLab/M3I-Pretraining.
Related papers
- Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate [118.37653302885607]
We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs)
MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results.
arXiv Detail & Related papers (2024-10-09T17:59:04Z) - Task-Oriented Pre-Training for Drivable Area Detection [5.57325257338134]
We propose a task-oriented pre-training method that begins with generating redundant segmentation proposals.
We then introduce a Specific Category Enhancement Fine-tuning (SCEF) strategy for fine-tuning the Contrastive Language-Image Pre-training (CLIP) model.
This approach can generate a lot of coarse training data for pre-training models, which are further fine-tuned using manually annotated data.
arXiv Detail & Related papers (2024-09-30T10:25:47Z) - Denoising Pre-Training and Customized Prompt Learning for Efficient Multi-Behavior Sequential Recommendation [69.60321475454843]
We propose DPCPL, the first pre-training and prompt-tuning paradigm tailored for Multi-Behavior Sequential Recommendation.
In the pre-training stage, we propose a novel Efficient Behavior Miner (EBM) to filter out the noise at multiple time scales.
Subsequently, we propose to tune the pre-trained model in a highly efficient manner with the proposed Customized Prompt Learning (CPL) module.
arXiv Detail & Related papers (2024-08-21T06:48:38Z) - The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis [27.310894780313618]
This paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints.
We confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes.
In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints.
arXiv Detail & Related papers (2024-04-01T16:00:01Z) - Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition [10.36399200974439]
We introduce a novel method combining multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach.
We empirically demonstrate that such a multi-stage approach leads to relative word error rate (WER) improvements of up to 38.45% over baselines on both Librispeech and SUPERB.
arXiv Detail & Related papers (2024-03-28T20:23:39Z) - Effective Adaptation in Multi-Task Co-Training for Unified Autonomous
Driving [103.745551954983]
In this paper, we investigate the transfer performance of various types of self-supervised methods, including MoCo and SimCLR, on three downstream tasks.
We find that their performances are sub-optimal or even lag far behind the single-task baseline.
We propose a simple yet effective pretrain-adapt-finetune paradigm for general multi-task training.
arXiv Detail & Related papers (2022-09-19T12:15:31Z) - i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations.
The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z) - Reinforcement Learning with Action-Free Pre-Training from Videos [95.25074614579646]
We introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos.
Our framework significantly improves both final performances and sample-efficiency of vision-based reinforcement learning.
arXiv Detail & Related papers (2022-03-25T19:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.