KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual
Commonsense Reasoning
- URL: http://arxiv.org/abs/2012.07000v1
- Date: Sun, 13 Dec 2020 08:22:33 GMT
- Title: KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual
Commonsense Reasoning
- Authors: Dandan Song, Siyi Ma, Zhanchen Sun, Sicheng Yang, Lejian Liao
- Abstract summary: In visual commonsense reasoning (VCR) task, a machine must answer correctly and then provide a rationale justifying its answer.
We propose a novel Knowledge Enhanced Visual-and-Linguistic BERT (KVL-BERT for short) model.
Besides taking visual and linguistic contents as input, external commonsense knowledge extracted from ConceptNet is integrated into the multi-layer Transformer.
- Score: 4.787501955202053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reasoning is a critical ability towards complete visual understanding. To
develop machine with cognition-level visual understanding and reasoning
abilities, the visual commonsense reasoning (VCR) task has been introduced. In
VCR, given a challenging question about an image, a machine must answer
correctly and then provide a rationale justifying its answer. The methods
adopting the powerful BERT model as the backbone for learning joint
representation of image content and natural language have shown promising
improvements on VCR. However, none of the existing methods have utilized
commonsense knowledge in visual commonsense reasoning, which we believe will be
greatly helpful in this task. With the support of commonsense knowledge,
complex questions even if the required information is not depicted in the image
can be answered with cognitive reasoning. Therefore, we incorporate commonsense
knowledge into the cross-modal BERT, and propose a novel Knowledge Enhanced
Visual-and-Linguistic BERT (KVL-BERT for short) model. Besides taking visual
and linguistic contents as input, external commonsense knowledge extracted from
ConceptNet is integrated into the multi-layer Transformer. In order to reserve
the structural information and semantic representation of the original
sentence, we propose using relative position embedding and mask-self-attention
to weaken the effect between the injected commonsense knowledge and other
unrelated components in the input sequence. Compared to other task-specific
models and general task-agnostic pre-training models, our KVL-BERT outperforms
them by a large margin.
Related papers
- SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge [60.76719375410635]
We propose a new benchmark (SOK-Bench) consisting of 44K questions and 10K situations with instance-level annotations depicted in the videos.
The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving.
We generate associated question-answer pairs and reasoning processes, finally followed by manual reviews for quality assurance.
arXiv Detail & Related papers (2024-05-15T21:55:31Z) - See, Think, Confirm: Interactive Prompting Between Vision and Language
Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning.
IPVR contains three stages, see, think and confirm.
We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z) - VLC-BERT: Visual Question Answering with Contextualized Commonsense
Knowledge [48.457788853408616]
We propose a method to generate, select, and encode external commonsense knowledge alongside visual and textual cues.
We show that VLC-BERT is capable of outperforming existing models that utilize static knowledge bases.
arXiv Detail & Related papers (2022-10-24T22:01:17Z) - Attention Mechanism based Cognition-level Scene Understanding [23.592893555879538]
The Visual Commonsense Reasoning (VCR) model can predict an answer with the corresponding rationale, which requires inference ability from the real world.
Previous approaches to solving the VCR task generally rely on pre-training or exploiting memory with long dependency relationship encoded models.
We propose a parallel attention-based cognitive VCR network PAVCR, which fuses visual-textual information efficiently and encodes semantic information in parallel to enable the model to capture rich information for cognition-level inference.
arXiv Detail & Related papers (2022-04-17T15:04:44Z) - KAT: A Knowledge Augmented Transformer for Vision-and-Language [56.716531169609915]
We propose a novel model - Knowledge Augmented Transformer (KAT) - which achieves a strong state-of-the-art result on the open-domain multimodal task of OK-VQA.
Our approach integrates implicit and explicit knowledge in an end to end encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation.
An additional benefit of explicit knowledge integration is seen in improved interpretability of model predictions in our analysis.
arXiv Detail & Related papers (2021-12-16T04:37:10Z) - KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain
Knowledge-Based VQA [107.7091094498848]
One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image.
In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time.
We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can be learned effectively from unsupervised language pre-training and supervised training data with transformer-based models.
arXiv Detail & Related papers (2020-12-20T20:13:02Z) - Common Sense or World Knowledge? Investigating Adapter-Based Knowledge
Injection into Pretrained Transformers [54.417299589288184]
We investigate models for complementing the distributional knowledge of BERT with conceptual knowledge from ConceptNet and its corresponding Open Mind Common Sense (OMCS) corpus.
Our adapter-based models substantially outperform BERT on inference tasks that require the type of conceptual knowledge explicitly present in ConceptNet and OMCS.
arXiv Detail & Related papers (2020-05-24T15:49:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.