Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines
- URL: http://arxiv.org/abs/2502.16641v1
- Date: Sun, 23 Feb 2025 16:39:39 GMT
- Title: Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines
- Authors: Xinwei Long, Zhiyuan Ma, Ermo Hua, Kaiyan Zhang, Biqing Qi, Bowen Zhou,
- Abstract summary: Retrieval-augmented generation (RAG) has emerged to address the knowledge-intensive visual question answering (VQA) task.<n>We propose ReAuSE, an alternative to the previous RAG model for the knowledge-based VQA task.<n>Our model functions both as a generative retriever and an accurate answer generator.
- Score: 17.803396998387665
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval-augmented generation (RAG) has emerged to address the knowledge-intensive visual question answering (VQA) task. Current methods mainly employ separate retrieval and generation modules to acquire external knowledge and generate answers, respectively. We propose ReAuSE, an alternative to the previous RAG model for the knowledge-based VQA task, which seamlessly integrates knowledge retriever into the generative multi-modal large language model, serving as a built-in search engine. Specifically, our model functions both as a generative retriever and an accurate answer generator. It not only helps retrieve documents from the knowledge base by producing identifiers for each document, but it also answers visual questions based on the retrieved documents. Furthermore, we propose a reinforced retrieval calibration module from relevance feedback to improve retrieval performance and align with the preferences for accurate answer generation. Extensive experiments on two representative OKVQA and A-OKVQA datasets demonstrate significant improvements ranging from 2.9\% to 9.6\% across all evaluation metrics when compared to strong baselines.
Related papers
- Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent [102.31558123570437]
Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs)<n>We propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch.
arXiv Detail & Related papers (2024-11-05T09:27:21Z) - Retriever-and-Memory: Towards Adaptive Note-Enhanced Retrieval-Augmented Generation [72.70046559930555]
We propose a generic RAG approach called Adaptive Note-Enhanced RAG (Adaptive-Note) for complex QA tasks.
Specifically, Adaptive-Note introduces an overarching view of knowledge growth, iteratively gathering new information in the form of notes.
In addition, we employ an adaptive, note-based stop-exploration strategy to decide "what to retrieve and when to stop" to encourage sufficient knowledge exploration.
arXiv Detail & Related papers (2024-10-11T14:03:29Z) - Multimodal Reranking for Knowledge-Intensive Visual Question Answering [77.24401833951096]
We introduce a multi-modal reranker to improve the ranking quality of knowledge candidates for answer generation.
Experiments on OK-VQA and A-OKVQA show that multi-modal reranker from distant supervision provides consistent improvements.
arXiv Detail & Related papers (2024-07-17T02:58:52Z) - Enhancing Retrieval and Managing Retrieval: A Four-Module Synergy for Improved Quality and Efficiency in RAG Systems [14.62114319247837]
Retrieval-augmented generation (RAG) techniques leverage the in-context learning capabilities of large language models (LLMs) to produce more accurate and relevant responses.
A critical component, the Query Rewriter module, enhances knowledge retrieval by generating a search-friendly query.
These four RAG modules synergistically improve the response quality and efficiency of the RAG system.
arXiv Detail & Related papers (2024-07-15T12:35:00Z) - Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering [11.183845003492964]
We use Passage Retrieval (DPR) to retrieve related knowledge to help the model answer questions.
DPR conduct retrieving in natural language space, which may not ensure comprehensive acquisition of image information.
We propose a novel framework that leverages the visual-language model to select the key knowledge retrieved by DPR and answer questions.
arXiv Detail & Related papers (2024-04-22T07:44:20Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - Multi-Grained Knowledge Retrieval for End-to-End Task-Oriented Dialog [42.088274728084265]
Retrieving proper domain knowledge from an external database lies at the heart of end-to-end task-oriented dialog systems.
Most existing systems blend knowledge retrieval with response generation and optimize them with direct supervision from reference responses.
We propose to decouple knowledge retrieval from response generation and introduce a multi-grained knowledge retriever.
arXiv Detail & Related papers (2023-05-17T12:12:46Z) - Enhancing Multi-modal and Multi-hop Question Answering via Structured
Knowledge and Unified Retrieval-Generation [33.56304858796142]
Multi-modal multi-hop question answering involves answering a question by reasoning over multiple input sources from different modalities.
Existing methods often retrieve evidences separately and then use a language model to generate an answer based on the retrieved evidences.
We propose a Structured Knowledge and Unified Retrieval-Generation (RG) approach to address these issues.
arXiv Detail & Related papers (2022-12-16T18:12:04Z) - UniKGQA: Unified Retrieval and Reasoning for Solving Multi-hop Question
Answering Over Knowledge Graph [89.98762327725112]
Multi-hop Question Answering over Knowledge Graph(KGQA) aims to find the answer entities that are multiple hops away from the topic entities mentioned in a natural language question.
We propose UniKGQA, a novel approach for multi-hop KGQA task, by unifying retrieval and reasoning in both model architecture and parameter learning.
arXiv Detail & Related papers (2022-12-02T04:08:09Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - Open-Retrieval Conversational Question Answering [62.11228261293487]
We introduce an open-retrieval conversational question answering (ORConvQA) setting, where we learn to retrieve evidence from a large collection before extracting answers.
We build an end-to-end system for ORConvQA, featuring a retriever, a reranker, and a reader that are all based on Transformers.
arXiv Detail & Related papers (2020-05-22T19:39:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.