ECOR: Explainable CLIP for Object Recognition
- URL: http://arxiv.org/abs/2404.12839v1
- Date: Fri, 19 Apr 2024 12:20:49 GMT
- Title: ECOR: Explainable CLIP for Object Recognition
- Authors: Ali Rasekh, Sepehr Kazemi Ranjbar, Milad Heidari, Wolfgang Nejdl,
- Abstract summary: We propose a mathematical definition of explainability in the object recognition task based on the joint probability distribution of categories and rationales.
Our method demonstrates state-of-the-art performance in explainable classification.
This advancement improves explainable object recognition, enhancing trust across diverse applications.
- Score: 4.385998292803586
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Vision Language Models (VLMs), such as CLIP, have significantly contributed to various computer vision tasks, including object recognition and object detection. Their open vocabulary feature enhances their value. However, their black-box nature and lack of explainability in predictions make them less trustworthy in critical domains. Recently, some work has been done to force VLMs to provide reasonable rationales for object recognition, but this often comes at the expense of classification accuracy. In this paper, we first propose a mathematical definition of explainability in the object recognition task based on the joint probability distribution of categories and rationales, then leverage this definition to fine-tune CLIP in an explainable manner. Through evaluations of different datasets, our method demonstrates state-of-the-art performance in explainable classification. Notably, it excels in zero-shot settings, showcasing its adaptability. This advancement improves explainable object recognition, enhancing trust across diverse applications. The code will be made available online upon publication.
Related papers
- Concept Visualization: Explaining the CLIP Multi-modal Embedding Using WordNet [4.597864989500202]
We propose a novel saliency methodology that explains the CLIP embedding of an image by exploiting the multi-modal nature of the embeddings.
ConVis makes use of lexical information from WordNet to compute task-agnostic Saliency Maps for any concept, not limited to concepts the end model was trained on.
arXiv Detail & Related papers (2024-05-23T13:41:17Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - Hierarchical Invariance for Robust and Interpretable Vision Tasks at Larger Scales [54.78115855552886]
We show how to construct over-complete invariants with a Convolutional Neural Networks (CNN)-like hierarchical architecture.
With the over-completeness, discriminative features w.r.t. the task can be adaptively formed in a Neural Architecture Search (NAS)-like manner.
For robust and interpretable vision tasks at larger scales, hierarchical invariant representation can be considered as an effective alternative to traditional CNN and invariants.
arXiv Detail & Related papers (2024-02-23T16:50:07Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings.
We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z) - Learning Common Rationale to Improve Self-Supervised Representation for
Fine-Grained Visual Recognition Problems [61.11799513362704]
We propose learning an additional screening mechanism to identify discriminative clues commonly seen across instances and classes.
We show that a common rationale detector can be learned by simply exploiting the GradCAM induced from the SSL objective.
arXiv Detail & Related papers (2023-03-03T02:07:40Z) - Doubly Right Object Recognition: A Why Prompt for Visual Rationales [28.408764714247837]
We investigate whether computer vision models can also provide correct rationales for their predictions.
We propose a doubly right'' object recognition benchmark, where the metric requires the model to simultaneously produce both the right labels as well as the right rationales.
arXiv Detail & Related papers (2022-12-12T19:25:45Z) - Object Recognition as Classification of Visual Properties [5.1652563977194434]
We present an object recognition process based on Ranganathan's four-phased faceted knowledge organization process.
We briefly introduce the ongoing project MultiMedia UKC, whose aim is to build an object recognition resource.
arXiv Detail & Related papers (2021-12-20T13:50:07Z) - Recognition Awareness: An Application of Latent Cognizance to Open-Set
Recognition [0.0]
Softmax mechanism forces a model to predict an object class out of a set of pre-defined labels.
This characteristic contributes to efficacy in classification, but poses a risk of non-sense prediction in object recognition.
Open-Set Recognition is intended to address an issue of identifying a foreign object in object recognition.
arXiv Detail & Related papers (2021-08-27T04:41:41Z) - Reasoning over Vision and Language: Exploring the Benefits of
Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers.
We empirically study the relevance of various KBs to multiple tasks and benchmarks.
The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.