Pretraining Frequency Predicts Compositional Generalization of CLIP on Real-World Tasks
- URL: http://arxiv.org/abs/2502.18326v1
- Date: Mon, 17 Feb 2025 16:52:02 GMT
- Title: Pretraining Frequency Predicts Compositional Generalization of CLIP on Real-World Tasks
- Authors: Thaddäus Wiedemer, Yash Sharma, Ameya Prabhu, Matthias Bethge, Wieland Brendel,
- Abstract summary: We show that CLIP learns to disentangle objects observed in its pretraining data and can recompose them straightforwardly.<n>For data curation in practice, our results suggest that balancing object occurrences improves generalization.
- Score: 25.94703673275305
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the success conditions for compositional generalization of CLIP models on real-world data through performance prediction. Prior work shows that CLIP requires exponentially more pretraining data for linear performance gains on individual concepts. This sample-inefficient scaling could be mitigated if CLIP systematically understood new inputs as compositions of learned components, allowing rare observation to be mapped to common concepts. To explore CLIP's compositional generalization ability, we filter retrieval corpora for samples with object combinations not present in the pretraining corpus. We show that CLIP's performance on these samples can be accurately predicted from the pretraining frequencies of individual objects. Our findings demonstrate that CLIP learns to disentangle objects observed in its pretraining data and can recompose them straightforwardly. Additionally, we are the first to show how this ability scales with pretraining data. For data curation in practice, our results suggest that balancing object occurrences improves generalization, which should benefit CLIP's efficiency and accuracy without scaling data volume.
Related papers
- Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation [19.749490092520006]
Self-Calibrated CLIP (SC-CLIP) is a training-free method that calibrates CLIP to produce finer representations.
SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times.
arXiv Detail & Related papers (2024-11-24T15:14:05Z) - Deciphering the Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models [3.9326597037266455]
Compositional Out of Distribution (C-OoD) generalization is relatively unexplored for CLIP models.
Our study reveals that the disentanglement of image and text representations, particularly with respect to their compositional elements, plays a crucial role in improving the generalization of CLIP models.
arXiv Detail & Related papers (2024-07-08T13:04:40Z) - What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights [67.72413262980272]
Severe data imbalance naturally exists among web-scale vision-language datasets.
We find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning.
The robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts.
arXiv Detail & Related papers (2024-05-31T17:57:24Z) - Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations [19.800907485589402]
Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks.
These tuned models tend to become highly specialized, limiting their practicality for real-world deployment.
We propose a lightweight representation calibration method for fine-tuning CLIP.
arXiv Detail & Related papers (2024-03-12T01:47:17Z) - Demystifying CLIP Data [86.34045746910114]
Contrastive Language-Image Pre-training (CLIP) has advanced research and applications in computer vision.<n>We introduce Metadata-Curated Language-Image Pre-training (MetaCLIP)<n> MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution.
arXiv Detail & Related papers (2023-09-28T17:59:56Z) - Understanding In-Context Learning via Supportive Pretraining Data [55.648777340129364]
In-context learning (ICL) improves language models' performance on a variety of NLP tasks by simply demonstrating a handful of examples at inference time.
It is not well understood why ICL ability emerges, as the model has never been specifically trained on such demonstrations.
Our work takes a first step towards understanding ICL via analyzing instance-level pretraining data.
arXiv Detail & Related papers (2023-06-26T22:14:04Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z) - Foundational Models for Continual Learning: An Empirical Study of Latent
Replay [17.322679682451597]
We study the efficacy of pre-trained vision models as a foundation for downstream continual learning scenarios.
We compare efficacy of various pre-trained models in large-scale benchmarking scenarios with a vanilla replay setting applied in the latent and in the raw-data space.
arXiv Detail & Related papers (2022-04-30T19:11:37Z) - The CLEAR Benchmark: Continual LEArning on Real-World Imagery [77.98377088698984]
Continual learning (CL) is widely regarded as crucial challenge for lifelong AI.
We introduce CLEAR, the first continual image classification benchmark dataset with a natural temporal evolution of visual concepts.
We find that a simple unsupervised pre-training step can already boost state-of-the-art CL algorithms.
arXiv Detail & Related papers (2022-01-17T09:09:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.