CyCLIP: Cyclic Contrastive Language-Image Pretraining
- URL: http://arxiv.org/abs/2205.14459v1
- Date: Sat, 28 May 2022 15:31:17 GMT
- Title: CyCLIP: Cyclic Contrastive Language-Image Pretraining
- Authors: Shashank Goel, Hritik Bansal, Sumit Bhatia, Ryan A. Rossi, Vishwa
Vinay, Aditya Grover
- Abstract summary: Recent advances in contrastive representation learning over paired image-text data have led to models such as CLIP that achieve state-of-the-art performance for zero-shot classification and distributional robustness.
We demonstrate that the image and text representations learned via a standard contrastive objective are not interchangeable and can lead to inconsistent downstream predictions.
We propose CyCLIP, a framework for contrastive representation learning that explicitly optimize for the learned representations to be geometrically consistent in the image and text space.
- Score: 34.588147979731374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in contrastive representation learning over paired image-text
data have led to models such as CLIP that achieve state-of-the-art performance
for zero-shot classification and distributional robustness. Such models
typically require joint reasoning in the image and text representation spaces
for downstream inference tasks. Contrary to prior beliefs, we demonstrate that
the image and text representations learned via a standard contrastive objective
are not interchangeable and can lead to inconsistent downstream predictions. To
mitigate this issue, we formalize consistency and propose CyCLIP, a framework
for contrastive representation learning that explicitly optimizes for the
learned representations to be geometrically consistent in the image and text
space. In particular, we show that consistent representations can be learned by
explicitly symmetrizing (a) the similarity between the two mismatched
image-text pairs (cross-modal consistency); and (b) the similarity between the
image-image pair and the text-text pair (in-modal consistency). Empirically, we
show that the improved consistency in CyCLIP translates to significant gains
over CLIP, with gains ranging from 10%-24% for zero-shot classification
accuracy on standard benchmarks (CIFAR-10, CIFAR-100, ImageNet1K) and 10%-27%
for robustness to various natural distribution shifts. The code is available at
https://github.com/goel-shashank/CyCLIP.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Finetuning CLIP to Reason about Pairwise Differences [52.028073305958074]
We propose an approach to train vision-language models such as CLIP in a contrastive manner to reason about differences in embedding space.
We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute.
We also illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space.
arXiv Detail & Related papers (2024-09-15T13:02:14Z) - Semantic Compositions Enhance Vision-Language Contrastive Learning [46.985865191341944]
We show that the zero-shot classification and retrieval capabilities of CLIP-like models can be improved significantly through the introduction of semantically composite examples during pretraining.
Our method fuses the captions and blends 50% of each image to form a new composite sample.
The benefits of CLIP-C are particularly pronounced in settings with relatively limited pretraining data.
arXiv Detail & Related papers (2024-07-01T15:58:20Z) - Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Hierarchical Text-Conditional Image Generation with CLIP Latents [20.476720970770128]
We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.
Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style.
arXiv Detail & Related papers (2022-04-13T01:10:33Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z) - Seed the Views: Hierarchical Semantic Alignment for Contrastive
Representation Learning [116.91819311885166]
We propose a hierarchical semantic alignment strategy via expanding the views generated by a single image to textbfCross-samples and Multi-level representation.
Our method, termed as CsMl, has the ability to integrate multi-level visual representations across samples in a robust way.
arXiv Detail & Related papers (2020-12-04T17:26:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.