Related papers: Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering

Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering

URL: http://arxiv.org/abs/2404.13947v2
Date: Sun, 16 Jun 2024 07:04:48 GMT
Title: Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering
Authors: Dongze Hao, Qunbo Wang, Longteng Guo, Jie Jiang, Jing Liu,
Abstract summary: We use Passage Retrieval (DPR) to retrieve related knowledge to help the model answer questions. DPR conduct retrieving in natural language space, which may not ensure comprehensive acquisition of image information. We propose a novel framework that leverages the visual-language model to select the key knowledge retrieved by DPR and answer questions.
Score: 11.183845003492964
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While large pre-trained visual-language models have shown promising results on traditional visual question answering benchmarks, it is still challenging for them to answer complex VQA problems which requires diverse world knowledge. Motivated by the research of retrieval-augmented generation in the field of natural language processing, we use Dense Passage Retrieval (DPR) to retrieve related knowledge to help the model answer questions. However, DPR conduct retrieving in natural language space, which may not ensure comprehensive acquisition of image information. Thus, the retrieved knowledge is not truly conducive to helping answer the question, affecting the performance of the overall system. To address this issue, we propose a novel framework that leverages the visual-language model to select the key knowledge retrieved by DPR and answer questions. The framework consists of two modules: Selector and Answerer, where both are initialized by the MLLM and parameter-efficiently finetuned by self-bootstrapping: find key knowledge in the retrieved knowledge documents using the Selector, and then use them to finetune the Answerer to predict answers; obtain the pseudo-labels of key knowledge documents based on the predictions of the Answerer and weak supervision labels, and then finetune the Selector to select key knowledge; repeat. Our framework significantly enhances the performance of the baseline on the challenging open-domain Knowledge-based VQA benchmark, OK-VQA, achieving a state-of-the-art accuracy of 62.83\%.

Related papers

Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines [17.803396998387665]
Retrieval-augmented generation (RAG) has emerged to address the knowledge-intensive visual question answering (VQA) task. We propose ReAuSE, an alternative to the previous RAG model for the knowledge-based VQA task. Our model functions both as a generative retriever and an accurate answer generator.
arXiv Detail & Related papers (2025-02-23T16:39:39Z)
Multimodal Reranking for Knowledge-Intensive Visual Question Answering [77.24401833951096]
We introduce a multi-modal reranker to improve the ranking quality of knowledge candidates for answer generation. Experiments on OK-VQA and A-OKVQA show that multi-modal reranker from distant supervision provides consistent improvements.
arXiv Detail & Related papers (2024-07-17T02:58:52Z)
Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual Question Answering [32.21000330743921]
We propose a novel framework that endows the model with capabilities of answering more general questions. Specifically, a well-defined detector is adopted to predict image-question related relation phrases. The optimal answer is predicted by choosing the supporting fact with the highest score.
arXiv Detail & Related papers (2023-12-20T02:35:18Z)
ChatKBQA: A Generate-then-Retrieve Framework for Knowledge Base Question Answering with Fine-tuned Large Language Models [19.85526116658481]
We introduce ChatKBQA, a novel and simple generate-then-retrieve KBQA framework. Experimental results show that ChatKBQA achieves new state-of-the-art performance on standard KBQA datasets. This work can also be regarded as a new paradigm for combining LLMs with knowledge graphs for interpretable and knowledge-required question answering.
arXiv Detail & Related papers (2023-10-13T09:45:14Z)
Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases. We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity) Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z)
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering [30.858737348472626]
Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question. Recent works have resorted to using a powerful large language model (LLM) as an implicit knowledge engine to acquire the necessary knowledge for answering. We present a conceptually simple, flexible, and general framework designed to prompt LLM with answers for knowledge-based VQA.
arXiv Detail & Related papers (2023-03-03T13:05:15Z)
REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering [75.53187719777812]
This paper revisits visual representation in knowledge-based visual question answering (VQA) We propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions. We achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin.
arXiv Detail & Related papers (2022-06-02T17:59:56Z)
KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA [107.7091094498848]
One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image. In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time. We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can be learned effectively from unsupervised language pre-training and supervised training data with transformer-based models.
arXiv Detail & Related papers (2020-12-20T20:13:02Z)
Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation. We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z)
Improving Commonsense Question Answering by Graph-based Iterative Retrieval over Multiple Knowledge Sources [26.256653692882715]
How to engage commonsense effectively in question answering systems is still under exploration. We propose a novel question-answering method by integrating ConceptNet, Wikipedia, and the Cambridge Dictionary. We use a pre-trained language model to encode the question, retrieved knowledge and choices, and propose an answer choice-aware attention mechanism.
arXiv Detail & Related papers (2020-11-05T08:50:43Z)
Knowledgeable Dialogue Reading Comprehension on Key Turns [84.1784903043884]
Multi-choice machine reading comprehension (MRC) requires models to choose the correct answer from candidate options given a passage and a question. Our research focuses dialogue-based MRC, where the passages are multi-turn dialogues. It suffers from two challenges, the answer selection decision is made without support of latently helpful commonsense, and the multi-turn context may hide considerable irrelevant information.
arXiv Detail & Related papers (2020-04-29T07:04:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.