CLIPPO: Image-and-Language Understanding from Pixels Only
- URL: http://arxiv.org/abs/2212.08045v2
- Date: Sat, 1 Apr 2023 21:01:36 GMT
- Title: CLIPPO: Image-and-Language Understanding from Pixels Only
- Authors: Michael Tschannen, Basil Mustafa, Neil Houlsby
- Abstract summary: We propose a pure pixel-based model to perform image, text, and multimodal tasks.
Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO)
When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks.
- Score: 36.433133689137875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal models are becoming increasingly effective, in part due to unified
components, such as the Transformer architecture. However, multimodal models
still often consist of many task- and modality-specific pieces and training
procedures. For example, CLIP (Radford et al., 2021) trains independent text
and image towers via a contrastive loss. We explore an additional unification:
the use of a pure pixel-based model to perform image, text, and multimodal
tasks. Our model is trained with contrastive loss alone, so we call it
CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both
regular images and text rendered as images. CLIPPO performs image-based tasks
such as retrieval and zero-shot image classification almost as well as
CLIP-style models, with half the number of parameters and no text-specific
tower or embedding. When trained jointly via image-text contrastive learning
and next-sentence contrastive learning, CLIPPO can perform well on natural
language understanding tasks, without any word-level loss (language modelling
or masked language modelling), outperforming pixel-based prior work.
Surprisingly, CLIPPO can obtain good accuracy in visual question answering,
simply by rendering the question and image together. Finally, we exploit the
fact that CLIPPO does not require a tokenizer to show that it can achieve
strong performance on multilingual multimodal retrieval without modifications.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Diffusion Feedback Helps CLIP See Better [40.125318318373715]
Contrastive Language-Image Pre-training (CLIP) excels at abstracting open-world representations across domains and modalities.
CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure.
We present a post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process.
arXiv Detail & Related papers (2024-07-29T17:00:09Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - Composed Image Retrieval using Contrastive Learning and Task-oriented
CLIP-based Features [32.138956674478116]
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one.
We use features from the OpenAI CLIP model to tackle the considered task.
We train a Combiner network that learns to combine the image-text features integrating the bimodal information.
arXiv Detail & Related papers (2023-08-22T15:03:16Z) - Fooling Contrastive Language-Image Pre-trained Models with CLIPMasterPrints [15.643898659673036]
We show that despite their versatility, CLIP models are vulnerable to what we refer to as fooling master images.
Fooling master images are capable of maximizing the confidence score of a CLIP model for a significant number of widely varying prompts.
We demonstrate how fooling master images for CLIPMasterPrints can be mined using gradient descent, projected descent, or blackbox optimization.
arXiv Detail & Related papers (2023-07-07T18:54:11Z) - Improving CLIP Training with Language Rewrites [57.935517901210225]
We introduce Language augmented CLIP (LaCLIP) to enhance CLIP training through language rewrites.
We show that LaCLIP significantly improves the transfer performance without computation or memory overhead during training.
Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M.
arXiv Detail & Related papers (2023-05-31T17:59:04Z) - S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist
Captions [69.01985134519244]
Vision-language models, such as contrastive language-image pre-training (CLIP), have demonstrated impressive results in natural image domains.
We propose S-CLIP, a semi-supervised learning method for training CLIP that utilizes additional unpaired images.
S-CLIP improves CLIP by 10% for zero-shot classification and 4% for image-text retrieval on the remote sensing benchmark.
arXiv Detail & Related papers (2023-05-23T14:18:11Z) - From Association to Generation: Text-only Captioning by Unsupervised
Cross-modal Mapping [20.67415815472257]
We propose a zero-shot method from association to generation for image captioning and video captioning.
Knight State-of-the-Art achieves performance in zero-shot methods for image captioning and video captioning.
arXiv Detail & Related papers (2023-04-26T04:06:20Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.