REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual
Question Answering
- URL: http://arxiv.org/abs/2206.01201v1
- Date: Thu, 2 Jun 2022 17:59:56 GMT
- Title: REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual
Question Answering
- Authors: Yuanze Lin, Yujia Xie, Dongdong Chen, Yichong Xu, Chenguang Zhu, Lu
Yuan
- Abstract summary: This paper revisits visual representation in knowledge-based visual question answering (VQA)
We propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions.
We achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin.
- Score: 75.53187719777812
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper revisits visual representation in knowledge-based visual question
answering (VQA) and demonstrates that using regional information in a better
way can significantly improve the performance. While visual representation is
extensively studied in traditional VQA, it is under-explored in knowledge-based
VQA even though these two tasks share the common spirit, i.e., rely on visual
input to answer the question. Specifically, we observe that in most
state-of-the-art knowledge-based VQA methods: 1) visual features are extracted
either from the whole image or in a sliding window manner for retrieving
knowledge, and the important relationship within/among object regions is
neglected; 2) visual features are not well utilized in the final answering
model, which is counter-intuitive to some extent. Based on these observations,
we propose a new knowledge-based VQA method REVIVE, which tries to utilize the
explicit information of object regions not only in the knowledge retrieval
stage but also in the answering model. The key motivation is that object
regions and inherent relationships are important for knowledge-based VQA. We
perform extensive experiments on the standard OK-VQA dataset and achieve new
state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous
state-of-the-art method by a large margin (+3.6%). We also conduct detailed
analysis and show the necessity of regional information in different framework
components for knowledge-based VQA.
Related papers
- Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering [11.183845003492964]
We use Passage Retrieval (DPR) to retrieve related knowledge to help the model answer questions.
DPR conduct retrieving in natural language space, which may not ensure comprehensive acquisition of image information.
We propose a novel framework that leverages the visual-language model to select the key knowledge retrieved by DPR and answer questions.
arXiv Detail & Related papers (2024-04-22T07:44:20Z) - Knowledge Condensation and Reasoning for Knowledge-based VQA [20.808840633377343]
Recent studies retrieve the knowledge passages from external knowledge bases and then use them to answer questions.
We propose two synergistic models: Knowledge Condensation model and Knowledge Reasoning model.
Our method achieves state-of-the-art performance on knowledge-based VQA datasets.
arXiv Detail & Related papers (2024-03-15T06:06:06Z) - Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual
Question Answering [27.38981906033932]
Outside-Knowledge Visual Question Answering (OK-VQA) systems employ a two-stage framework that first retrieves external knowledge and then predicts the answer.
Retrievals are frequently too general and fail to cover specific knowledge needed to answer the question.
We propose an Entity-Focused Retrieval (EnFoRe) model that provides stronger supervision during training and recognizes question-relevant entities to help retrieve more specific knowledge.
arXiv Detail & Related papers (2022-10-18T21:39:24Z) - Dynamic Key-value Memory Enhanced Multi-step Graph Reasoning for
Knowledge-based Visual Question Answering [18.926582410644375]
Knowledge-based visual question answering (VQA) is a vision-language task that requires an agent to correctly answer image-related questions.
We propose a novel model named dynamic knowledge memory enhanced multi-step graph reasoning (DMMGR)
Our model achieves new state-of-the-art accuracy on the KRVQR and FVQA datasets.
arXiv Detail & Related papers (2022-03-06T15:19:39Z) - Improving and Diagnosing Knowledge-Based Visual Question Answering via
Entity Enhanced Knowledge Injection [14.678153928301493]
Knowledge-Based Visual Question Answering (KBVQA) is a bi-modal task requiring external world knowledge in order to correctly answer a text question and associated image.
Recent single text work has shown knowledge injection into pre-trained language models, specifically entity enhanced knowledge graph embeddings, can improve performance on downstream entity-centric tasks.
arXiv Detail & Related papers (2021-12-13T18:45:42Z) - Achieving Human Parity on Visual Question Answering [67.22500027651509]
The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image.
This paper describes our recent research of AliceMind-MMU that obtains similar or even slightly better results than human beings does on VQA.
This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task.
arXiv Detail & Related papers (2021-11-17T04:25:11Z) - Reasoning over Vision and Language: Exploring the Benefits of
Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers.
We empirically study the relevance of various KBs to multiple tasks and benchmarks.
The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z) - KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain
Knowledge-Based VQA [107.7091094498848]
One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image.
In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time.
We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can be learned effectively from unsupervised language pre-training and supervised training data with transformer-based models.
arXiv Detail & Related papers (2020-12-20T20:13:02Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z) - In Defense of Grid Features for Visual Question Answering [65.71985794097426]
We revisit grid features for visual question answering (VQA) and find they can work surprisingly well.
We verify that this observation holds true across different VQA models and generalizes well to other tasks like image captioning.
We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training.
arXiv Detail & Related papers (2020-01-10T18:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.