CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning
- URL: http://arxiv.org/abs/2405.19547v2
- Date: Fri, 20 Dec 2024 04:54:04 GMT
- Title: CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning
- Authors: Yiping Wang, Yifang Chen, Wendan Yan, Alex Fang, Wenjing Zhou, Kevin Jamieson, Simon Shaolei Du,
- Abstract summary: Data selection has emerged as a core issue for large-scale visual-language model pretaining (e.g., CLIP)
Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data, and (3) designing better metrics or strategies universally applicable to any CLIP embedding.
- Score: 19.100022935748225
- License:
- Abstract: Data selection has emerged as a core issue for large-scale visual-language model pretaining (e.g., CLIP), particularly with noisy web-curated datasets. Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data than the original OpenAI CLIP model, and (3) designing better metrics or strategies universally applicable to any CLIP embedding without requiring specific model properties (e.g., CLIPScore is one popular metric). While the first two approaches have been extensively studied, the third remains under-explored. In this paper, we advance the third approach by proposing two new methods. Firstly, instead of classical CLIP scores that only consider the alignment between two modalities from a single sample, we introduce surrogate-CLIPLoss (s-CLIPLoss), a CLIP loss-inspired method that adds the alignment between one sample and its contrastive pairs as an extra normalization term for better quality measurement. Secondly, when downstream tasks are known, we propose a new norm-based metric, NormSim, to measure the similarity between pretraining data and target data. We test our methods on the data selection benchmark, DataComp~\cite{gadre2023datacomp}. Compared to the best baseline using only OpenAI's CLIP-L/14, our methods achieve a 5.3\% improvement on ImageNet-1k and a 2.8\% improvement on 38 downstream evaluation tasks. Moreover, both s-CLIPLoss and NormSim are compatible with existing techniques. By combining our methods with the current best methods DFN and HYPE, we can boost average performance on downstream tasks by 0.9\%, achieving a new state-of-the-art on the DataComp-medium benchmark.
Related papers
- Rethinking Few-shot 3D Point Cloud Semantic Segmentation [62.80639841429669]
This paper revisits few-shot 3D point cloud semantic segmentation (FS-PCS)
We focus on two significant issues in the state-of-the-art: foreground leakage and sparse point distribution.
To address these issues, we introduce a standardized FS-PCS setting, upon which a new benchmark is built.
arXiv Detail & Related papers (2024-03-01T15:14:47Z) - DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection [72.25697820290502]
This work introduces a straightforward and efficient strategy to identify potential novel classes through zero-shot classification.
We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training.
Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance.
arXiv Detail & Related papers (2023-10-02T17:52:24Z) - Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data [122.282521548393]
Contrastive Language-Image Pre-training (CLIP) has become the standard for cross-modal image-text representation learning.
We introduce HELIP, a cost-effective strategy that improves CLIP models by exploiting challenging text-image pairs within existing datasets in continuous training.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - DataComp: In search of the next generation of multimodal datasets [179.79323076587255]
DataComp is a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl.
Our benchmark consists of multiple compute scales spanning four orders of magnitude.
In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet.
arXiv Detail & Related papers (2023-04-27T11:37:18Z) - Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior
Refinement [24.108008515395458]
We propose APE, an Adaptive Prior rEfinement method for CLIP's pre-trained knowledge, which achieves superior accuracy with high computational efficiency.
For the average accuracy over 11 benchmarks, both APE and APE-T attain state-of-the-art and respectively outperform the second-best by +1.59% and +1.99% under 16 shots with x30 less learnable parameters.
arXiv Detail & Related papers (2023-04-03T17:58:54Z) - Boosting Low-Data Instance Segmentation by Unsupervised Pre-training
with Saliency Prompt [103.58323875748427]
This work offers a novel unsupervised pre-training solution for low-data regimes.
Inspired by the recent success of the Prompting technique, we introduce a new pre-training method that boosts QEIS models.
Experimental results show that our method significantly boosts several QEIS models on three datasets.
arXiv Detail & Related papers (2023-02-02T15:49:03Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z) - Deep Active Ensemble Sampling For Image Classification [8.31483061185317]
Active learning frameworks aim to reduce the cost of data annotation by actively requesting the labeling for the most informative data points.
Some proposed approaches include uncertainty-based techniques, geometric methods, implicit combination of uncertainty-based and geometric approaches.
We present an innovative integration of recent progress in both uncertainty-based and geometric frameworks to enable an efficient exploration/exploitation trade-off in sample selection strategy.
Our framework provides two advantages: (1) accurate posterior estimation, and (2) tune-able trade-off between computational overhead and higher accuracy.
arXiv Detail & Related papers (2022-10-11T20:20:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.