Related papers: MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

URL: http://arxiv.org/abs/2311.17049v2
Date: Mon, 1 Apr 2024 13:06:06 GMT
Title: MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Authors: Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel,
Abstract summary: We introduce MobileCLIP, a new family of efficient image-text models optimized for runtime performance. MobileCLIP uses knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders to improve the accuracy of efficient models. Our approach avoids train-time compute overhead by storing the additional knowledge in a reinforced dataset.
Score: 17.158498267947877
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders with significant memory and latency overhead which pose challenges for deployment on mobile devices. In this work, we introduce MobileCLIP -- a new family of efficient image-text models optimized for runtime performance along with a novel and efficient training approach, namely multi-modal reinforced training. The proposed training approach leverages knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders to improve the accuracy of efficient models. Our approach avoids train-time compute overhead by storing the additional knowledge in a reinforced dataset. MobileCLIP sets a new state-of-the-art latency-accuracy tradeoff for zero-shot classification and retrieval tasks on several datasets. Our MobileCLIP-S2 variant is 2.3$\times$ faster while more accurate compared to previous best CLIP model based on ViT-B/16. We further demonstrate the effectiveness of our multi-modal reinforced training by training a CLIP model based on ViT-B/16 image backbone and achieving +2.9% average performance improvement on 38 evaluation benchmarks compared to the previous best. Moreover, we show that the proposed approach achieves 10$\times$-1000$\times$ improved learning efficiency when compared with non-reinforced CLIP training. Code and models are available at https://github.com/apple/ml-mobileclip .

Related papers

FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed [14.677270805094311]
Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets.<n>We propose a novel pre-training strategy for DINOv2 that simultaneously accelerates convergence and strengthens robustness to common corruptions as a by-product.
arXiv Detail & Related papers (2025-07-04T18:56:04Z)
AmorLIP: Efficient Language-Image Pretraining via Amortization [47.4350993430346]
Contrastive Language-Image Pretraining (CLIP) has demonstrated strong zero-shot performance across diverse downstream text-image tasks.<n>We propose AmorLIP, an efficient CLIP pretraining framework that amortizes expensive computations involved in contrastive learning through lightweight neural networks.
arXiv Detail & Related papers (2025-05-25T05:30:37Z)
CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling [21.65268178160724]
Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs. We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture.
arXiv Detail & Related papers (2025-02-03T00:04:50Z)
CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance [14.849943391904882]
We propose CLIP-PING: Contrastive Language-Image Pre-training with Proximus Intrinsic Neighbors Guidance. CLIP-PING bootstraps unimodal features extracted from arbitrary pre-trained encoders to obtain intrinsic guidance of proximus neighbor samples. Experiments reveal that CLIP-PING notably surpasses its peers in zero-shot generalization and cross-modal retrieval tasks.
arXiv Detail & Related papers (2024-12-05T04:58:28Z)
Active Data Curation Effectively Distills Large-Scale Multimodal Models [66.23057263509027]
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining. Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations.
arXiv Detail & Related papers (2024-11-27T18:50:15Z)
Effective pruning of web-scale datasets based on complexity of concept clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet. We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs. We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z)
GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training [78.63699436330165]
Cross-modal pre-training has shown impressive performance on a wide range of downstream tasks. Online data are growing constantly, highlighting the importance of the ability of pre-trained model to learn from data that is continuously growing. We propose GrowCLIP, a data-driven automatic model growing algorithm for contrastive language-image pre-training with continuous image-text pairs as input.
arXiv Detail & Related papers (2023-08-22T10:07:49Z)
Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time. Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z)
One-stop Training of Multiple Capacity Models [74.87789190840527]
We propose a novel one-stop training framework to jointly train high-capacity and low-capactiy models. Unlike knowledge distillation, where multiple capacity models are trained from scratch separately, our approach integrates supervisions from different capacity models simultaneously.
arXiv Detail & Related papers (2023-05-23T13:44:09Z)
Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models. Our method allows for effortless integration with existing models' training pipelines. On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z)
Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement [24.108008515395458]
We propose APE, an Adaptive Prior rEfinement method for CLIP's pre-trained knowledge, which achieves superior accuracy with high computational efficiency. For the average accuracy over 11 benchmarks, both APE and APE-T attain state-of-the-art and respectively outperform the second-best by +1.59% and +1.99% under 16 shots with x30 less learnable parameters.
arXiv Detail & Related papers (2023-04-03T17:58:54Z)
Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification [58.06983806317233]
Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations using large-scale image-text pairs. To enhance CLIP's adaption capability, existing methods proposed to fine-tune additional learnable modules. We propose a training-free adaption method for CLIP to conduct few-shot classification, termed as Tip-Adapter.
arXiv Detail & Related papers (2022-07-19T19:12:11Z)
Personalizing Pre-trained Models [23.145974171912414]
We consider how upstream pretrained models can be leveraged for downstream few-shot, multilabel, and continual learning tasks. Our model CLIPPER (CLIP PERsonalized) uses image representations from CLIP, a large-scale image representation learning model trained using weak natural language supervision.
arXiv Detail & Related papers (2021-06-02T22:58:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.