Learnable Visual Words for Interpretable Image Recognition
- URL: http://arxiv.org/abs/2205.10724v2
- Date: Thu, 26 May 2022 14:43:21 GMT
- Title: Learnable Visual Words for Interpretable Image Recognition
- Authors: Wenxiao Xiao, Zhengming Ding, Hongfu Liu
- Abstract summary: We propose the Learnable Visual Words (LVW) to interpret the model prediction behaviors with two novel modules.
The semantic visual words learning relaxes the category-specific constraint, enabling the general visual words shared across different categories.
Our experiments on six visual benchmarks demonstrate the superior effectiveness of our proposed LVW in both accuracy and model interpretation.
- Score: 70.85686267987744
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To interpret deep models' predictions, attention-based visual cues are widely
used in addressing \textit{why} deep models make such predictions. Beyond that,
the current research community becomes more interested in reasoning
\textit{how} deep models make predictions, where some prototype-based methods
employ interpretable representations with their corresponding visual cues to
reveal the black-box mechanism of deep model behaviors. However, these
pioneering attempts only either learn the category-specific prototypes and
deteriorate their generalizing capacities, or demonstrate several illustrative
examples without a quantitative evaluation of visual-based interpretability
with further limitations on their practical usages. In this paper, we revisit
the concept of visual words and propose the Learnable Visual Words (LVW) to
interpret the model prediction behaviors with two novel modules: semantic
visual words learning and dual fidelity preservation. The semantic visual words
learning relaxes the category-specific constraint, enabling the general visual
words shared across different categories. Beyond employing the visual words for
prediction to align visual words with the base model, our dual fidelity
preservation also includes the attention guided semantic alignment that
encourages the learned visual words to focus on the same conceptual regions for
prediction. Experiments on six visual benchmarks demonstrate the superior
effectiveness of our proposed LVW in both accuracy and model interpretation
over the state-of-the-art methods. Moreover, we elaborate on various in-depth
analyses to further explore the learned visual words and the generalizability
of our method for unseen categories.
Related papers
- Interpreting Pretrained Language Models via Concept Bottlenecks [55.47515772358389]
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks.
The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation.
We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
arXiv Detail & Related papers (2023-11-08T20:41:18Z) - What Do Deep Saliency Models Learn about Visual Attention? [28.023464783469738]
We present a novel analytic framework that sheds light on the implicit features learned by saliency models.
Our approach decomposes these implicit features into interpretable bases that are explicitly aligned with semantic attributes.
arXiv Detail & Related papers (2023-10-14T23:15:57Z) - Rewrite Caption Semantics: Bridging Semantic Gaps for
Language-Supervised Semantic Segmentation [100.81837601210597]
We propose Concept Curation (CoCu) to bridge the gap between visual and textual semantics in pre-training data.
CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin.
arXiv Detail & Related papers (2023-09-24T00:05:39Z) - Describe me an Aucklet: Generating Grounded Perceptual Category
Descriptions [2.7195102129095003]
We introduce a framework for testing category-level perceptual grounding in multi-modal language models.
We train separate neural networks to generate and interpret descriptions of visual categories.
We show that communicative success exposes performance issues in the generation model.
arXiv Detail & Related papers (2023-03-07T17:01:25Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Towards explainable evaluation of language models on the semantic
similarity of visual concepts [0.0]
We examine the behavior of high-performing pre-trained language models, focusing on the task of semantic similarity for visual vocabularies.
First, we address the need for explainable evaluation metrics, necessary for understanding the conceptual quality of retrieved instances.
Secondly, adversarial interventions on salient query semantics expose vulnerabilities of opaque metrics and highlight patterns in learned linguistic representations.
arXiv Detail & Related papers (2022-09-08T11:40:57Z) - VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [113.50220968583353]
We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning.
Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity.
We demonstrate that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
arXiv Detail & Related papers (2022-03-20T03:49:02Z) - Building a visual semantics aware object hierarchy [0.0]
We propose a novel unsupervised method to build visual semantics aware object hierarchy.
Our intuition in this paper comes from real-world knowledge representation where concepts are hierarchically organized.
The evaluation consists of two parts, firstly we apply the constructed hierarchy on the object recognition task and then we compare our visual hierarchy and existing lexical hierarchies to show the validity of our method.
arXiv Detail & Related papers (2022-02-26T00:10:21Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.