Interpretable Visual Understanding with Cognitive Attention Network
- URL: http://arxiv.org/abs/2108.02924v3
- Date: Thu, 7 Dec 2023 23:09:57 GMT
- Title: Interpretable Visual Understanding with Cognitive Attention Network
- Authors: Xuejiao Tang, Wenbin Zhang, Yi Yu, Kea Turner, Tyler Derr, Mengyu Wang
and Eirini Ntoutsi
- Abstract summary: We propose a novel Cognitive Attention Network (CAN) for visual commonsense reasoning.
We first introduce an image-text fusion module to fuse information from images and text collectively.
Second, a novel inference module is designed to encode commonsense among image, query and response.
- Score: 20.991018495051623
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: While image understanding on recognition-level has achieved remarkable
advancements, reliable visual scene understanding requires comprehensive image
understanding on recognition-level but also cognition-level, which calls for
exploiting the multi-source information as well as learning different levels of
understanding and extensive commonsense knowledge. In this paper, we propose a
novel Cognitive Attention Network (CAN) for visual commonsense reasoning to
achieve interpretable visual understanding. Specifically, we first introduce an
image-text fusion module to fuse information from images and text collectively.
Second, a novel inference module is designed to encode commonsense among image,
query and response. Extensive experiments on large-scale Visual Commonsense
Reasoning (VCR) benchmark dataset demonstrate the effectiveness of our
approach. The implementation is publicly available at
https://github.com/tanjatang/CAN
Related papers
- TIPS: Text-Image Pretraining with Spatial Awareness [13.38247732379754]
Self-supervised image-only pretraining is still the go-to method for many vision applications.
We propose a novel general-purpose image-text model, which can be effectively used off-the-shelf for dense and global vision tasks.
arXiv Detail & Related papers (2024-10-21T21:05:04Z) - OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities.
It can accept various visual and text prompts for flexible user interaction.
OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z) - Multi-modal Visual Understanding with Prompts for Semantic Information
Disentanglement of Image [0.0]
Multi-modal visual understanding of images with prompts involves using various visual and textual cues to enhance the semantic understanding of images.
By utilizing prompt-based techniques, models can learn to focus on certain features of an image to extract useful information for downstream tasks.
arXiv Detail & Related papers (2023-05-16T10:15:44Z) - Understanding ME? Multimodal Evaluation for Fine-grained Visual
Commonsense [98.70218717851665]
It is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources.
We present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge.
We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation.
arXiv Detail & Related papers (2022-11-10T21:44:33Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - Knowledge Graph Augmented Network Towards Multiview Representation
Learning for Aspect-based Sentiment Analysis [96.53859361560505]
We propose a knowledge graph augmented network (KGAN) to incorporate external knowledge with explicitly syntactic and contextual information.
KGAN captures the sentiment feature representations from multiple perspectives, i.e., context-, syntax- and knowledge-based.
Experiments on three popular ABSA benchmarks demonstrate the effectiveness and robustness of our KGAN.
arXiv Detail & Related papers (2022-01-13T08:25:53Z) - The Curious Layperson: Fine-Grained Image Recognition without Expert
Labels [90.88501867321573]
We consider a new problem: fine-grained image recognition without expert annotations.
We learn a model to describe the visual appearance of objects using non-expert image descriptions.
We then train a fine-grained textual similarity model that matches image descriptions with documents on a sentence-level basis.
arXiv Detail & Related papers (2021-11-05T17:58:37Z) - KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual
Commonsense Reasoning [4.787501955202053]
In visual commonsense reasoning (VCR) task, a machine must answer correctly and then provide a rationale justifying its answer.
We propose a novel Knowledge Enhanced Visual-and-Linguistic BERT (KVL-BERT for short) model.
Besides taking visual and linguistic contents as input, external commonsense knowledge extracted from ConceptNet is integrated into the multi-layer Transformer.
arXiv Detail & Related papers (2020-12-13T08:22:33Z) - Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.