Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision
Transformers for High-Level Image Classification
- URL: http://arxiv.org/abs/2402.19339v1
- Date: Thu, 29 Feb 2024 16:46:48 GMT
- Title: Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision
Transformers for High-Level Image Classification
- Authors: Delfina Sol Martinez Pandiani, Nicolas Lazzari, Valentina Presutti
- Abstract summary: We leverage situated perceptual knowledge of cultural images to enhance performance and interpretability in AC image classification.
This resource captures situated perceptual semantics gleaned from over 14,000 cultural images labeled with ACs.
We demonstrate the synergy and complementarity between KGE embeddings' situated perceptual knowledge and deep visual model's sensory-perceptual understanding for AC image classification.
- Score: 0.1843404256219181
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The increasing demand for automatic high-level image understanding,
particularly in detecting abstract concepts (AC) within images, underscores the
necessity for innovative and more interpretable approaches. These approaches
need to harmonize traditional deep vision methods with the nuanced,
context-dependent knowledge humans employ to interpret images at intricate
semantic levels. In this work, we leverage situated perceptual knowledge of
cultural images to enhance performance and interpretability in AC image
classification. We automatically extract perceptual semantic units from images,
which we then model and integrate into the ARTstract Knowledge Graph (AKG).
This resource captures situated perceptual semantics gleaned from over 14,000
cultural images labeled with ACs. Additionally, we enhance the AKG with
high-level linguistic frames. We compute KG embeddings and experiment with
relative representations and hybrid approaches that fuse these embeddings with
visual transformer embeddings. Finally, for interpretability, we conduct
posthoc qualitative analyses by examining model similarities with training
instances. Our results show that our hybrid KGE-ViT methods outperform existing
techniques in AC image classification. The posthoc interpretability analyses
reveal the visual transformer's proficiency in capturing pixel-level visual
attributes, contrasting with our method's efficacy in representing more
abstract and semantic scene elements. We demonstrate the synergy and
complementarity between KGE embeddings' situated perceptual knowledge and deep
visual model's sensory-perceptual understanding for AC image classification.
This work suggests a strong potential of neuro-symbolic methods for knowledge
integration and robust image representation for use in downstream intricate
visual comprehension tasks. All the materials and code are available online.
Related papers
- Trustworthy Image Semantic Communication with GenAI: Explainablity, Controllability, and Efficiency [59.15544887307901]
Image semantic communication (ISC) has garnered significant attention for its potential to achieve high efficiency in visual content transmission.
Existing ISC systems based on joint source-channel coding face challenges in interpretability, operability, and compatibility.
We propose a novel trustworthy ISC framework that employs Generative Artificial Intelligence (GenAI) for multiple downstream inference tasks.
arXiv Detail & Related papers (2024-08-07T14:32:36Z) - Knowledge Fused Recognition: Fusing Hierarchical Knowledge for Image Recognition through Quantitative Relativity Modeling and Deep Metric Learning [18.534970504136254]
We propose a novel deep metric learning based method to fuse hierarchical prior knowledge about image classes.
Existing deep metric learning incorporated image classification mainly exploits qualitative relativity between image classes.
A new triplet loss function term that exploits quantitative relativity and aligns distances in model latent space with those in knowledge space is also proposed and incorporated in the proposed dual-modality fusion method.
arXiv Detail & Related papers (2024-07-30T07:24:33Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - CEIR: Concept-based Explainable Image Representation Learning [0.4198865250277024]
We introduce Concept-based Explainable Image Representation (CEIR) to derive high-quality representations without label dependency.
Our method exhibits state-of-the-art unsupervised clustering performance on benchmarks such as CIFAR10, CIFAR100, and STL10.
CEIR can seamlessly extract the related concept from open-world images without fine-tuning.
arXiv Detail & Related papers (2023-12-17T15:37:41Z) - Analyzing Vision Transformers for Image Classification in Class
Embedding Space [5.210197476419621]
This work introduces a method to reverse-engineer Vision Transformers trained to solve image classification tasks.
Inspired by previous research in NLP, we demonstrate how the inner representations at any level of the hierarchy can be projected onto the learned class space.
We use our framework to show how image tokens develop class-specific representations that depend on attention mechanisms and contextual information.
arXiv Detail & Related papers (2023-10-29T10:25:23Z) - Seeing the Intangible: Survey of Image Classification into High-Level
and Abstract Categories [0.20718016474717196]
The field of Computer Vision (CV) is increasingly shifting towards high-level'' visual sensemaking tasks.
This paper systematically reviews research on high-level visual understanding, focusing on Abstract Concepts (ACs) in automatic image classification.
arXiv Detail & Related papers (2023-08-21T08:37:04Z) - StyleEDL: Style-Guided High-order Attention Network for Image Emotion
Distribution Learning [69.06749934902464]
We propose a style-guided high-order attention network for image emotion distribution learning termed StyleEDL.
StyleEDL interactively learns stylistic-aware representations of images by exploring the hierarchical stylistic information of visual contents.
In addition, we introduce a stylistic graph convolutional network to dynamically generate the content-dependent emotion representations.
arXiv Detail & Related papers (2023-08-06T03:22:46Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - ExCon: Explanation-driven Supervised Contrastive Learning for Image
Classification [12.109442912963969]
We propose to leverage saliency-based explanation methods to create content-preserving masked augmentations for contrastive learning.
Our novel explanation-driven supervised contrastive learning (ExCon) methodology critically serves the dual goals of encouraging nearby image embeddings to have similar content and explanation.
We demonstrate that ExCon outperforms vanilla supervised contrastive learning in terms of classification, explanation quality, adversarial robustness as well as calibration of probabilistic predictions of the model in the context of distributional shift.
arXiv Detail & Related papers (2021-11-28T23:15:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.