PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts
- URL: http://arxiv.org/abs/2308.01313v3
- Date: Mon, 18 Mar 2024 16:02:10 GMT
- Title: PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts
- Authors: Bang An, Sicheng Zhu, Michael-Andrei Panaitescu-Liess, Chaithanya Kumar Mummadi, Furong Huang,
- Abstract summary: This paper draws inspiration from the human visual perception process.
We propose a training-free, two-step zero-shot classification method PerceptionCLIP.
Our experiments show that PerceptionCLIP achieves better generalization, group robustness, and interoperability.
- Score: 33.109305627550405
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models like CLIP are widely used in zero-shot image classification due to their ability to understand various visual concepts and natural language descriptions. However, how to fully leverage CLIP's unprecedented human-like understanding capabilities to achieve better performance is still an open question. This paper draws inspiration from the human visual perception process: when classifying an object, humans first infer contextual attributes (e.g., background and orientation) which help separate the foreground object from the background, and then classify the object based on this information. Inspired by it, we observe that providing CLIP with contextual attributes improves zero-shot image classification and mitigates reliance on spurious features. We also observe that CLIP itself can reasonably infer the attributes from an image. With these observations, we propose a training-free, two-step zero-shot classification method PerceptionCLIP. Given an image, it first infers contextual attributes (e.g., background) and then performs object classification conditioning on them. Our experiments show that PerceptionCLIP achieves better generalization, group robustness, and interoperability. Our code is available at https://github.com/umd-huang-lab/perceptionCLIP
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Interpreting and Analyzing CLIP's Zero-Shot Image Classification via Mutual Knowledge [20.09852220432504]
Contrastive Language-Image Pretraining (CLIP) performs zero-shot image classification by mapping images and textual class representation into a shared embedding space.
This work provides a new approach for interpreting CLIP models for image classification from the lens of mutual knowledge between the two modalities.
arXiv Detail & Related papers (2024-10-16T20:18:21Z) - Finetuning CLIP to Reason about Pairwise Differences [52.028073305958074]
We propose an approach to train vision-language models such as CLIP in a contrastive manner to reason about differences in embedding space.
We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute.
We also illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space.
arXiv Detail & Related papers (2024-09-15T13:02:14Z) - FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos [19.08882495584709]
We show that it is possible to enhance CLIP's fine-grained and syntactic abilities without compromising its semantic properties.
We adapt CLIP efficiently on a high-quality, comprehensive, and relatively small dataset.
We learn a powerful visual representation, dubbed Fine-Grained CLIP (FiGCLIP), that preserves semantic understanding while being detail-oriented.
arXiv Detail & Related papers (2024-01-15T13:27:34Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z) - CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification [7.6146285961466]
We are one of the first methods to use CLIP (Contrastive Language-Image Pre-Training) to train a neural network on a variety of artwork images and text descriptions pairs.
Our approach aims to solve 2 challenges: instance retrieval and fine-grained artwork attribute recognition.
In this benchmark we achieved competitive results using only self-supervision.
arXiv Detail & Related papers (2022-04-29T17:17:24Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z) - CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual
Entailment [102.17010696898113]
We show that CLIP can be a strong vision-language few-shot learner by leveraging the power of language.
We propose a parameter-efficient fine-tuning strategy to boost the few-shot performance on the vqa task.
arXiv Detail & Related papers (2022-03-14T15:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.