Improving and Diagnosing Knowledge-Based Visual Question Answering via
Entity Enhanced Knowledge Injection
- URL: http://arxiv.org/abs/2112.06888v1
- Date: Mon, 13 Dec 2021 18:45:42 GMT
- Title: Improving and Diagnosing Knowledge-Based Visual Question Answering via
Entity Enhanced Knowledge Injection
- Authors: Diego Garcia-Olano, Yasumasa Onoe, Joydeep Ghosh
- Abstract summary: Knowledge-Based Visual Question Answering (KBVQA) is a bi-modal task requiring external world knowledge in order to correctly answer a text question and associated image.
Recent single text work has shown knowledge injection into pre-trained language models, specifically entity enhanced knowledge graph embeddings, can improve performance on downstream entity-centric tasks.
- Score: 14.678153928301493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge-Based Visual Question Answering (KBVQA) is a bi-modal task
requiring external world knowledge in order to correctly answer a text question
and associated image. Recent single modality text work has shown knowledge
injection into pre-trained language models, specifically entity enhanced
knowledge graph embeddings, can improve performance on downstream
entity-centric tasks. In this work, we empirically study how and whether such
methods, applied in a bi-modal setting, can improve an existing VQA system's
performance on the KBVQA task. We experiment with two large publicly available
VQA datasets, (1) KVQA which contains mostly rare Wikipedia entities and (2)
OKVQA which is less entity-centric and more aligned with common sense
reasoning. Both lack explicit entity spans and we study the effect of different
weakly supervised and manual methods for obtaining them. Additionally we
analyze how recently proposed bi-modal and single modal attention explanations
are affected by the incorporation of such entity enhanced representations. Our
results show substantial improved performance on the KBVQA task without the
need for additional costly pre-training and we provide insights for when entity
knowledge injection helps improve a model's understanding. We provide code and
enhanced datasets for reproducibility.
Related papers
- FusionMind -- Improving question and answering with external context
fusion [0.0]
We studied the impact of contextual knowledge on the Question-answering (QA) objective using pre-trained language models (LMs) and knowledge graphs (KGs)
We found that incorporating knowledge facts context led to a significant improvement in performance.
This suggests that the integration of contextual knowledge facts may be more impactful for enhancing question answering performance.
arXiv Detail & Related papers (2023-12-31T03:51:31Z) - Utilizing Background Knowledge for Robust Reasoning over Traffic
Situations [63.45021731775964]
We focus on a complementary research aspect of Intelligent Transportation: traffic understanding.
We scope our study to text-based methods and datasets given the abundant commonsense knowledge.
We adopt three knowledge-driven approaches for zero-shot QA over traffic situations.
arXiv Detail & Related papers (2022-12-04T09:17:24Z) - Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual
Question Answering [27.38981906033932]
Outside-Knowledge Visual Question Answering (OK-VQA) systems employ a two-stage framework that first retrieves external knowledge and then predicts the answer.
Retrievals are frequently too general and fail to cover specific knowledge needed to answer the question.
We propose an Entity-Focused Retrieval (EnFoRe) model that provides stronger supervision during training and recognizes question-relevant entities to help retrieve more specific knowledge.
arXiv Detail & Related papers (2022-10-18T21:39:24Z) - REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual
Question Answering [75.53187719777812]
This paper revisits visual representation in knowledge-based visual question answering (VQA)
We propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions.
We achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin.
arXiv Detail & Related papers (2022-06-02T17:59:56Z) - VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks
for Visual Question Answering [79.22069768972207]
We propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations.
Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context.
On two challenging VQA tasks, our method outperforms strong baseline VQA methods by 3.2% on VCR and 4.6% on GQA, suggesting its strength in performing concept-level reasoning.
arXiv Detail & Related papers (2022-05-23T17:55:34Z) - Achieving Human Parity on Visual Question Answering [67.22500027651509]
The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image.
This paper describes our recent research of AliceMind-MMU that obtains similar or even slightly better results than human beings does on VQA.
This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task.
arXiv Detail & Related papers (2021-11-17T04:25:11Z) - Reasoning over Vision and Language: Exploring the Benefits of
Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers.
We empirically study the relevance of various KBs to multiple tasks and benchmarks.
The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z) - KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain
Knowledge-Based VQA [107.7091094498848]
One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image.
In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time.
We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can be learned effectively from unsupervised language pre-training and supervised training data with transformer-based models.
arXiv Detail & Related papers (2020-12-20T20:13:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.