GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive
Language-Image Pre-training
- URL: http://arxiv.org/abs/2308.11331v1
- Date: Tue, 22 Aug 2023 10:07:49 GMT
- Title: GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive
Language-Image Pre-training
- Authors: Xinchi Deng, Han Shi, Runhui Huang, Changlin Li, Hang Xu, Jianhua Han,
James Kwok, Shen Zhao, Wei Zhang, Xiaodan Liang
- Abstract summary: Cross-modal pre-training has shown impressive performance on a wide range of downstream tasks.
Online data are growing constantly, highlighting the importance of the ability of pre-trained model to learn from data that is continuously growing.
We propose GrowCLIP, a data-driven automatic model growing algorithm for contrastive language-image pre-training with continuous image-text pairs as input.
- Score: 78.63699436330165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal pre-training has shown impressive performance on a wide range of
downstream tasks, benefiting from massive image-text pairs collected from the
Internet. In practice, online data are growing constantly, highlighting the
importance of the ability of pre-trained model to learn from data that is
continuously growing. Existing works on cross-modal pre-training mainly focus
on training a network with fixed architecture. However, it is impractical to
limit the model capacity when considering the continuously growing nature of
pre-training data in real-world applications. On the other hand, it is
important to utilize the knowledge in the current model to obtain efficient
training and better performance. To address the above issues, in this paper, we
propose GrowCLIP, a data-driven automatic model growing algorithm for
contrastive language-image pre-training with continuous image-text pairs as
input. Specially, we adopt a dynamic growth space and seek out the optimal
architecture at each growth step to adapt to online learning scenarios. And the
shared encoder is proposed in our growth space to enhance the degree of
cross-modal fusion. Besides, we explore the effect of growth in different
dimensions, which could provide future references for the design of cross-modal
model architecture. Finally, we employ parameter inheriting with momentum (PIM)
to maintain the previous knowledge and address the issue of the local minimum
dilemma. Compared with the existing methods, GrowCLIP improves 2.3% average
top-1 accuracy on zero-shot image classification of 9 downstream tasks. As for
zero-shot image retrieval, GrowCLIP can improve 1.2% for top-1 image-to-text
recall on Flickr30K dataset.
Related papers
- MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training [17.158498267947877]
We introduce MobileCLIP, a new family of efficient image-text models optimized for runtime performance.
MobileCLIP uses knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders to improve the accuracy of efficient models.
Our approach avoids train-time compute overhead by storing the additional knowledge in a reinforced dataset.
arXiv Detail & Related papers (2023-11-28T18:55:42Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot
Vision-Language Tasks [60.46473247205654]
Using large-scale unsupervised unimodal models as pre-training can enhance the zero-shot performance of image-text pair models.
Our experiments show that unimodal pre-training outperforms state-of-the-art CLIP-based models.
arXiv Detail & Related papers (2023-06-07T18:26:22Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models.
Our method allows for effortless integration with existing models' training pipelines.
On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - Prefix Language Models are Unified Modal Learners [30.666873206462295]
We show that a unified modal model could be learned with a prefix language modeling objective upon text and image sequences.
Thanks to the simple but powerful pre-training paradigm, our proposed model, DaVinci, is simple to train, scalable to huge data, and adaptable to a variety of downstream tasks.
arXiv Detail & Related papers (2022-06-15T17:49:38Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.