CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling
- URL: http://arxiv.org/abs/2502.00965v1
- Date: Mon, 03 Feb 2025 00:04:50 GMT
- Title: CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling
- Authors: Xinze Wang, Chen Chen, Yinfei Yang, Hong-You Chen, Bowen Zhang, Aditya Pal, Xiangxin Zhu, Xianzhi Du,
- Abstract summary: Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs.
We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture.
- Score: 21.65268178160724
- License:
- Abstract: Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs. While integrating MoE into multimodal models like CLIP improves performance, training these models is notoriously challenging and expensive. We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture. Through extensive experimentation with various settings and auxiliary losses, we demonstrate that CLIP-UP significantly reduces training complexity and cost. Remarkably, our sparse CLIP B/16 model, trained with CLIP-UP, outperforms its dense counterpart by 7.2% and 6.6% on COCO and Flickr30k text-to-image Recall@1 benchmarks respectively. It even surpasses the larger CLIP L/14 model on this task while using only 30% of the inference FLOPs. We further demonstrate the generalizability of our training recipe across different scales, establishing sparse upcycling as a practical and scalable approach for building efficient, high-performance CLIP models.
Related papers
- Active Data Curation Effectively Distills Large-Scale Multimodal Models [66.23057263509027]
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones.
In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining.
Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations.
arXiv Detail & Related papers (2024-11-27T18:50:15Z) - CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling [21.734200158914476]
Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence.
DMU efficiently fine-tunes a series of CLIP models that capture different feature spaces.
Experiments demonstrate the significant performance of CLIP-MoE across various zero-shot retrieval, zero-shot image classification tasks.
arXiv Detail & Related papers (2024-09-28T09:28:51Z) - Compact Language Models via Pruning and Knowledge Distillation [61.56557874432008]
Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch.
Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch.
arXiv Detail & Related papers (2024-07-19T21:47:57Z) - Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies [27.809995478990544]
This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) when scaled down to limited computation budgets.
We show that a smaller dataset of high-quality data can outperform a larger dataset with lower quality.
We compare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+Data Augmentation - and show that the choice of training strategy depends on the available compute resource.
arXiv Detail & Related papers (2024-04-12T02:04:34Z) - MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with
Module-wise Pruning Error Metric [57.3330687266266]
We find that using smaller pre-trained models and applying magnitude-based pruning on CLIP models leads to inflexibility and inferior performance.
Using the Module-wise Pruning Error (MoPE) metric, we introduce a unified pruning framework applicable to both pre-training and task-specific fine-tuning compression stages.
arXiv Detail & Related papers (2024-03-12T17:24:26Z) - MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training [17.158498267947877]
We introduce MobileCLIP, a new family of efficient image-text models optimized for runtime performance.
MobileCLIP uses knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders to improve the accuracy of efficient models.
Our approach avoids train-time compute overhead by storing the additional knowledge in a reinforced dataset.
arXiv Detail & Related papers (2023-11-28T18:55:42Z) - Demystifying CLIP Data [86.34045746910114]
Contrastive Language-Image Pre-training (CLIP) has advanced research and applications in computer vision.
We introduce Metadata-Curated Language-Image Pre-training (MetaCLIP)
MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution.
arXiv Detail & Related papers (2023-09-28T17:59:56Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z) - CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1
Accuracy with ViT-B and ViT-L on ImageNet [139.56863124214905]
We find that fine-tuning performance of CLIP is substantially underestimated.
Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset.
arXiv Detail & Related papers (2022-12-12T18:59:59Z) - Do Pre-trained Models Benefit Equally in Continual Learning? [25.959813589169176]
Existing work on continual learning (CL) is primarily devoted to developing algorithms for models trained from scratch.
Despite their encouraging performance on contrived benchmarks, these algorithms show dramatic performance drops in real-world scenarios.
This paper advocates the systematic introduction of pre-training to CL.
arXiv Detail & Related papers (2022-10-27T18:03:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.