A Simple Baseline for Knowledge-Based Visual Question Answering
- URL: http://arxiv.org/abs/2310.13570v2
- Date: Tue, 24 Oct 2023 13:24:25 GMT
- Title: A Simple Baseline for Knowledge-Based Visual Question Answering
- Authors: Alexandros Xenos, Themos Stafylakis, Ioannis Patras and Georgios
Tzimiropoulos
- Abstract summary: This paper is on the problem of Knowledge-Based Visual Question Answering (KB-VQA)
Our main contribution in this paper is to propose a much simpler and readily reproducible pipeline.
Contrary to recent approaches, our method is training-free, does not require access to external databases or APIs, and achieves state-of-the-art accuracy on the OK-VQA and A-OK-VQA datasets.
- Score: 78.00758742784532
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper is on the problem of Knowledge-Based Visual Question Answering
(KB-VQA). Recent works have emphasized the significance of incorporating both
explicit (through external databases) and implicit (through LLMs) knowledge to
answer questions requiring external knowledge effectively. A common limitation
of such approaches is that they consist of relatively complicated pipelines and
often heavily rely on accessing GPT-3 API. Our main contribution in this paper
is to propose a much simpler and readily reproducible pipeline which, in a
nutshell, is based on efficient in-context learning by prompting LLaMA (1 and
2) using question-informative captions as contextual information. Contrary to
recent approaches, our method is training-free, does not require access to
external databases or APIs, and yet achieves state-of-the-art accuracy on the
OK-VQA and A-OK-VQA datasets. Finally, we perform several ablation studies to
understand important aspects of our method. Our code is publicly available at
https://github.com/alexandrosXe/ASimple-Baseline-For-Knowledge-Based-VQA
Related papers
- Contri(e)ve: Context + Retrieve for Scholarly Question Answering [0.0]
We present a two step solution using open source Large Language Model(LLM): Llama3.1 for Scholarly-QALD dataset.
Firstly, we extract the context pertaining to the question from different structured and unstructured data sources.
Secondly, we implement prompt engineering to improve the information retrieval performance of the LLM.
arXiv Detail & Related papers (2024-09-13T17:38:47Z) - HOLMES: Hyper-Relational Knowledge Graphs for Multi-hop Question Answering using LLMs [9.559336828884808]
Large Language Models (LLMs) are adept at answering simple (single-hop) questions.
As the complexity of the questions increase, the performance of LLMs degrades.
Recent methods try to reduce this burden by integrating structured knowledge triples into the raw text.
We propose to use a knowledge graph (KG) that is context-aware and is distilled to contain query-relevant information.
arXiv Detail & Related papers (2024-06-10T05:22:49Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - An In-Context Schema Understanding Method for Knowledge Base Question
Answering [70.87993081445127]
Large Language Models (LLMs) have shown strong capabilities in language understanding and can be used to solve this task.
Existing methods bypass this challenge by initially employing LLMs to generate drafts of logic forms without schema-specific details.
We propose a simple In-Context Understanding (ICSU) method that enables LLMs to directly understand schemas by leveraging in-context learning.
arXiv Detail & Related papers (2023-10-22T04:19:17Z) - Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases.
We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity)
Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z) - LaKo: Knowledge-driven Visual Question Answering via Late
Knowledge-to-Text Injection [30.65373229617201]
We propose LaKo, a knowledge-driven VQA method via Late Knowledge-to-text Injection.
To effectively incorporate an external KG, we transfer triples into text and propose a late injection mechanism.
In the evaluation with OKVQA datasets, our method achieves state-of-the-art results.
arXiv Detail & Related papers (2022-07-26T13:29:51Z) - Multifaceted Improvements for Conversational Open-Domain Question
Answering [54.913313912927045]
We propose a framework with Multifaceted Improvements for Conversational open-domain Question Answering (MICQA)
Firstly, the proposed KL-divergence based regularization is able to lead to a better question understanding for retrieval and answer reading.
Second, the added post-ranker module can push more relevant passages to the top placements and be selected for reader with a two-aspect constrains.
Third, the well designed curriculum learning strategy effectively narrows the gap between the golden passage settings of training and inference, and encourages the reader to find true answer without the golden passage assistance.
arXiv Detail & Related papers (2022-04-01T07:54:27Z) - MuKEA: Multimodal Knowledge Extraction and Accumulation for
Knowledge-based Visual Question Answering [23.628740943735167]
We propose MuKEA to represent multimodal knowledge by an explicit triplet to correlate visual objects and fact answers with implicit relations.
By adopting a pre-training and fine-tuning learning strategy, both basic and domain-specific multimodal knowledge are progressively accumulated for answer prediction.
arXiv Detail & Related papers (2022-03-17T07:42:14Z) - KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain
Knowledge-Based VQA [107.7091094498848]
One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image.
In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time.
We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can be learned effectively from unsupervised language pre-training and supervised training data with transformer-based models.
arXiv Detail & Related papers (2020-12-20T20:13:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.