Boosting Visual-Language Models by Exploiting Hard Samples
- URL: http://arxiv.org/abs/2305.05208v2
- Date: Sun, 10 Mar 2024 14:00:53 GMT
- Title: Boosting Visual-Language Models by Exploiting Hard Samples
- Authors: Haonan Wang, Minbin Huang, Runhui Huang, Lanqing Hong, Hang Xu,
Tianyang Hu, Xiaodan Liang, Zhenguo Li, Hong Cheng, Kenji Kawaguchi
- Abstract summary: HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models.
Our method allows for effortless integration with existing models' training pipelines.
On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
- Score: 126.35125029639168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) has become the standard for
learning cross-modal representations between images and text. Efforts to
improve its capabilities typically demand the collection of additional data and
retraining with new loss functions. While effective, the added requirements
limit their practical use due to the increased resource and time investments
needed. In this work, we present HELIP, a cost-effective strategy tailored to
enhance the performance of existing CLIP models without the need for training a
model from scratch or collecting additional data. Our method allows for
effortless integration with existing models' training pipelines, providing an
instant boost by training them with selected challenging text-image pairs from
their original training datasets. HELIP treats each text-image pair as a single
point in the joint vision-language space, identifying those in close proximity
as hard pairs. By incorporating the challenging data, pre-trained CLIP models
are refined using both the traditional contrastive loss and the newly
introduced hard negative margin loss, ensuring the challenging data is fully
utilized. On comprehensive benchmarks, HELIP consistently boosts existing
models to achieve leading performance. In particular, it improves the zero-shot
classification accuracy on ImageNet for SLIP models pre-trained on CC3M, CC12M
and YFCC15M datasets. The improvements are 3.05%, 4.47%, and 10.1%
respectively, achieved within two epochs of training. In addition, across
fine-grained classification datasets, HELIP improves the zero-shot performance
of pre-trained CLIP and SLIP by an average of 8.4% and 18.6%, and their linear
probe performance by an average of 9.5% and 3.0%.
Related papers
- Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity [11.414069074535007]
Contrastive Language-Image Pre-training on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization.
Small subsets of training data that provably generalize the best has remained an open question.
We show that subsets that closely preserve the cross-covariance of the images and captions of the full data provably achieve a superior generalization performance.
arXiv Detail & Related papers (2024-03-18T21:32:58Z) - A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation [121.0693322732454]
Contrastive Language-Image Pretraining (CLIP) has gained popularity for its remarkable zero-shot capacity.
Recent research has focused on developing efficient fine-tuning methods to enhance CLIP's performance in downstream tasks.
We revisit a classical algorithm, Gaussian Discriminant Analysis (GDA), and apply it to the downstream classification of CLIP.
arXiv Detail & Related papers (2024-02-06T15:45:27Z) - Effective pruning of web-scale datasets based on complexity of concept
clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet.
We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs.
We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z) - VeCLIP: Improving CLIP Training via Visual-enriched Captions [63.547204530720705]
This study introduces a scalable pipeline for noisy caption rewriting.
We emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap)
We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP.
arXiv Detail & Related papers (2023-10-11T17:49:13Z) - Demystifying CLIP Data [86.34045746910114]
Contrastive Language-Image Pre-training (CLIP) has advanced research and applications in computer vision.
We introduce Metadata-Curated Language-Image Pre-training (MetaCLIP)
MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution.
arXiv Detail & Related papers (2023-09-28T17:59:56Z) - Preventing Zero-Shot Transfer Degradation in Continual Learning of
Vision-Language Models [13.340759455910721]
We propose a novel method to prevent zero-shot transfer degradation in the continual learning of vision-language models.
Our method outperforms other methods in the traditional class-incremental learning setting.
arXiv Detail & Related papers (2023-03-12T10:28:07Z) - Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification [58.06983806317233]
Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations using large-scale image-text pairs.
To enhance CLIP's adaption capability, existing methods proposed to fine-tune additional learnable modules.
We propose a training-free adaption method for CLIP to conduct few-shot classification, termed as Tip-Adapter.
arXiv Detail & Related papers (2022-07-19T19:12:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.