No Token Left Behind: Explainability-Aided Image Classification and
Generation
- URL: http://arxiv.org/abs/2204.04908v1
- Date: Mon, 11 Apr 2022 07:16:39 GMT
- Title: No Token Left Behind: Explainability-Aided Image Classification and
Generation
- Authors: Roni Paiss, Hila Chefer, Lior Wolf
- Abstract summary: We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
- Score: 79.4957965474334
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The application of zero-shot learning in computer vision has been
revolutionized by the use of image-text matching models. The most notable
example, CLIP, has been widely used for both zero-shot classification and
guiding generative models with a text prompt. However, the zero-shot use of
CLIP is unstable with respect to the phrasing of the input text, making it
necessary to carefully engineer the prompts used. We find that this instability
stems from a selective similarity score, which is based only on a subset of the
semantically meaningful input tokens. To mitigate it, we present a novel
explainability-based approach, which adds a loss term to ensure that CLIP
focuses on all relevant semantic parts of the input, in addition to employing
the CLIP similarity loss used in previous works. When applied to one-shot
classification through prompt engineering, our method yields an improvement in
the recognition rate, without additional training or fine-tuning. Additionally,
we show that CLIP guidance of generative models using our method significantly
improves the generated images. Finally, we demonstrate a novel use of CLIP
guidance for text-based image generation with spatial conditioning on object
location, by requiring the image explainability heatmap for each object to be
confined to a pre-determined bounding box.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Finetuning CLIP to Reason about Pairwise Differences [52.028073305958074]
We propose an approach to train vision-language models such as CLIP in a contrastive manner to reason about differences in embedding space.
We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute.
We also illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space.
arXiv Detail & Related papers (2024-09-15T13:02:14Z) - Semantic Compositions Enhance Vision-Language Contrastive Learning [46.985865191341944]
We show that the zero-shot classification and retrieval capabilities of CLIP-like models can be improved significantly through the introduction of semantically composite examples during pretraining.
Our method fuses the captions and blends 50% of each image to form a new composite sample.
The benefits of CLIP-C are particularly pronounced in settings with relatively limited pretraining data.
arXiv Detail & Related papers (2024-07-01T15:58:20Z) - Updating CLIP to Prefer Descriptions Over Captions [21.909877614471178]
We update the CLIP model to assign higher scores to descriptions than captions.
This model correlates with the judgements of blind and low-vision people while preserving transfer capabilities.
arXiv Detail & Related papers (2024-06-12T20:24:51Z) - Anchor-based Robust Finetuning of Vision-Language Models [46.87279531333293]
We aim at finetuning a vision-language model without hurting its out-of-distribution generalization.
We propose to compensate for the finetune process using auxiliary supervision with rich semantic information.
Our method achieves in-distribution performance akin to conventional finetuning.
arXiv Detail & Related papers (2024-04-09T12:10:54Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - DisCLIP: Open-Vocabulary Referring Expression Generation [37.789850573203694]
We build on CLIP, a large-scale visual-semantic model, to guide an LLM to generate a contextual description of a target concept in an image.
We measure the quality of the generated text by evaluating the capability of a receiver model to accurately identify the described object within the scene.
Our results highlight the potential of using pre-trained visual-semantic models for generating high-quality contextual descriptions.
arXiv Detail & Related papers (2023-05-30T15:13:17Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.