CLIPPO: Image-and-Language Understanding from Pixels Only
- URL: http://arxiv.org/abs/2212.08045v2
- Date: Sat, 1 Apr 2023 21:01:36 GMT
- Title: CLIPPO: Image-and-Language Understanding from Pixels Only
- Authors: Michael Tschannen, Basil Mustafa, Neil Houlsby
- Abstract summary: We propose a pure pixel-based model to perform image, text, and multimodal tasks.
Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO)
When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks.
- Score: 36.433133689137875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal models are becoming increasingly effective, in part due to unified
components, such as the Transformer architecture. However, multimodal models
still often consist of many task- and modality-specific pieces and training
procedures. For example, CLIP (Radford et al., 2021) trains independent text
and image towers via a contrastive loss. We explore an additional unification:
the use of a pure pixel-based model to perform image, text, and multimodal
tasks. Our model is trained with contrastive loss alone, so we call it
CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both
regular images and text rendered as images. CLIPPO performs image-based tasks
such as retrieval and zero-shot image classification almost as well as
CLIP-style models, with half the number of parameters and no text-specific
tower or embedding. When trained jointly via image-text contrastive learning
and next-sentence contrastive learning, CLIPPO can perform well on natural
language understanding tasks, without any word-level loss (language modelling
or masked language modelling), outperforming pixel-based prior work.
Surprisingly, CLIPPO can obtain good accuracy in visual question answering,
simply by rendering the question and image together. Finally, we exploit the
fact that CLIPPO does not require a tokenizer to show that it can achieve
strong performance on multilingual multimodal retrieval without modifications.
Related papers
- Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - Composed Image Retrieval using Contrastive Learning and Task-oriented
CLIP-based Features [32.138956674478116]
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one.
We use features from the OpenAI CLIP model to tackle the considered task.
We train a Combiner network that learns to combine the image-text features integrating the bimodal information.
arXiv Detail & Related papers (2023-08-22T15:03:16Z) - Fooling Contrastive Language-Image Pre-trained Models with CLIPMasterPrints [15.643898659673036]
We show that despite their versatility, CLIP models are vulnerable to what we refer to as fooling master images.
Fooling master images are capable of maximizing the confidence score of a CLIP model for a significant number of widely varying prompts.
We demonstrate how fooling master images for CLIPMasterPrints can be mined using gradient descent, projected descent, or blackbox optimization.
arXiv Detail & Related papers (2023-07-07T18:54:11Z) - Improving CLIP Training with Language Rewrites [57.935517901210225]
We introduce Language augmented CLIP (LaCLIP) to enhance CLIP training through language rewrites.
We show that LaCLIP significantly improves the transfer performance without computation or memory overhead during training.
Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M.
arXiv Detail & Related papers (2023-05-31T17:59:04Z) - S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist
Captions [69.01985134519244]
Vision-language models, such as contrastive language-image pre-training (CLIP), have demonstrated impressive results in natural image domains.
We propose S-CLIP, a semi-supervised learning method for training CLIP that utilizes additional unpaired images.
S-CLIP improves CLIP by 10% for zero-shot classification and 4% for image-text retrieval on the remote sensing benchmark.
arXiv Detail & Related papers (2023-05-23T14:18:11Z) - From Association to Generation: Text-only Captioning by Unsupervised
Cross-modal Mapping [20.67415815472257]
We propose a zero-shot method from association to generation for image captioning and video captioning.
Knight State-of-the-Art achieves performance in zero-shot methods for image captioning and video captioning.
arXiv Detail & Related papers (2023-04-26T04:06:20Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - ProtoCLIP: Prototypical Contrastive Language Image Pretraining [12.067061175987075]
Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping.
ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge.
ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data.
arXiv Detail & Related papers (2022-06-22T11:55:53Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.