Related papers: Multimodal Reranking for Knowledge-Intensive Visual Question Answering

Multimodal Reranking for Knowledge-Intensive Visual Question Answering

URL: http://arxiv.org/abs/2407.12277v1
Date: Wed, 17 Jul 2024 02:58:52 GMT
Title: Multimodal Reranking for Knowledge-Intensive Visual Question Answering
Authors: Haoyang Wen, Honglei Zhuang, Hamed Zamani, Alexander Hauptmann, Michael Bendersky,
Abstract summary: We introduce a multi-modal reranker to improve the ranking quality of knowledge candidates for answer generation. Experiments on OK-VQA and A-OKVQA show that multi-modal reranker from distant supervision provides consistent improvements.
Score: 77.24401833951096
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Knowledge-intensive visual question answering requires models to effectively use external knowledge to help answer visual questions. A typical pipeline includes a knowledge retriever and an answer generator. However, a retriever that utilizes local information, such as an image patch, may not provide reliable question-candidate relevance scores. Besides, the two-tower architecture also limits the relevance score modeling of a retriever to select top candidates for answer generator reasoning. In this paper, we introduce an additional module, a multi-modal reranker, to improve the ranking quality of knowledge candidates for answer generation. Our reranking module takes multi-modal information from both candidates and questions and performs cross-item interaction for better relevance score modeling. Experiments on OK-VQA and A-OKVQA show that multi-modal reranker from distant supervision provides consistent improvements. We also find a training-testing discrepancy with reranking in answer generation, where performance improves if training knowledge candidates are similar to or noisier than those used in testing.

Related papers

UniRVQA: A Unified Framework for Retrieval-Augmented Vision Question Answering via Self-Reflective Joint Training [16.14877145354785]
We propose a Unified Retrieval-Augmented VQA framework (UniRVQA) for knowledge-intensive visual questions. UniRVQA adapts general multimodal pre-trained models for fine-grained knowledge-intensive tasks within a unified framework. Our approach achieves competitive performance against state-of-the-art models, delivering a significant 4.7% improvement in answering accuracy, and brings an average 7.5% boost in base MLLMs' VQA performance.
arXiv Detail & Related papers (2025-04-05T05:42:12Z)
Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines [17.803396998387665]
Retrieval-augmented generation (RAG) has emerged to address the knowledge-intensive visual question answering (VQA) task. We propose ReAuSE, an alternative to the previous RAG model for the knowledge-based VQA task. Our model functions both as a generative retriever and an accurate answer generator.
arXiv Detail & Related papers (2025-02-23T16:39:39Z)
Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering [11.183845003492964]
We use Passage Retrieval (DPR) to retrieve related knowledge to help the model answer questions. DPR conduct retrieving in natural language space, which may not ensure comprehensive acquisition of image information. We propose a novel framework that leverages the visual-language model to select the key knowledge retrieved by DPR and answer questions.
arXiv Detail & Related papers (2024-04-22T07:44:20Z)
Open-Set Knowledge-Based Visual Question Answering with Inference Paths [79.55742631375063]
The purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases. We propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity) Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process.
arXiv Detail & Related papers (2023-10-12T09:12:50Z)
PICK: Polished & Informed Candidate Scoring for Knowledge-Grounded Dialogue Systems [59.1250765143521]
Current knowledge-grounded dialogue systems often fail to align the generated responses with human-preferred qualities. We propose Polished & Informed Candidate Scoring (PICK), a generation re-scoring framework. We demonstrate the effectiveness of PICK in generating responses that are more faithful while keeping them relevant to the dialogue history.
arXiv Detail & Related papers (2023-09-19T08:27:09Z)
End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries. We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion. We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z)
Multi-Grained Knowledge Retrieval for End-to-End Task-Oriented Dialog [42.088274728084265]
Retrieving proper domain knowledge from an external database lies at the heart of end-to-end task-oriented dialog systems. Most existing systems blend knowledge retrieval with response generation and optimize them with direct supervision from reference responses. We propose to decouple knowledge retrieval from response generation and introduce a multi-grained knowledge retriever.
arXiv Detail & Related papers (2023-05-17T12:12:46Z)
Retrieval Augmented Visual Question Answering with Outside Knowledge [14.371342370460685]
Outside-Knowledge Visual Question Answering (OK-VQA) is a challenging VQA task that requires retrieval of external knowledge to answer questions about images. Recent OK-VQA systems use Dense Passage Retrieval (DPR) to retrieve documents from external knowledge bases, such as Wikipedia, but with DPR trained separately from answer generation. We propose a joint training scheme which includes differentiable DPR integrated with answer generation so that the system can be trained in an end-to-end fashion.
arXiv Detail & Related papers (2022-10-07T20:35:58Z)
Question Answering Survey: Directions, Challenges, Datasets, Evaluation Matrices [0.0]
The research directions of QA field are analyzed based on the type of question, answer type, source of evidence-answer, and modeling approach. This detailed followed by open challenges of the field like automatic question generation, similarity detection and, low resource availability for a language.
arXiv Detail & Related papers (2021-12-07T08:53:40Z)
Retrieve, Program, Repeat: Complex Knowledge Base Question Answering via Alternate Meta-learning [56.771557756836906]
We present a novel method that automatically learns a retrieval model alternately with the programmer from weak supervision. Our system leads to state-of-the-art performance on a large-scale task for complex question answering over knowledge bases.
arXiv Detail & Related papers (2020-10-29T18:28:16Z)
MS-Ranker: Accumulating Evidence from Potentially Correct Candidates for Answer Selection [59.95429407899612]
We propose a novel reinforcement learning based multi-step ranking model, named MS-Ranker. We explicitly consider the potential correctness of candidates and update the evidence with a gating mechanism. Our model significantly outperforms existing methods that do not rely on external resources.
arXiv Detail & Related papers (2020-10-10T10:36:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.