Fine-grained Late-interaction Multi-modal Retrieval for Retrieval
Augmented Visual Question Answering
- URL: http://arxiv.org/abs/2309.17133v2
- Date: Sat, 28 Oct 2023 16:03:35 GMT
- Title: Fine-grained Late-interaction Multi-modal Retrieval for Retrieval
Augmented Visual Question Answering
- Authors: Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, Bill Byrne
- Abstract summary: Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions.
This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA.
- Score: 56.96857992123026
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to
utilize knowledge from external knowledge bases to answer visually-grounded
questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong
framework to tackle KB-VQA, first retrieves related documents with Dense
Passage Retrieval (DPR) and then uses them to answer questions. This paper
proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which
significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major
limitations in RA-VQA's retriever: (1) the image representations obtained via
image-to-text transforms can be incomplete and inaccurate and (2) relevance
scores between queries and documents are computed with one-dimensional
embeddings, which can be insensitive to finer-grained relevance. FLMR overcomes
these limitations by obtaining image representations that complement those from
the image-to-text transforms using a vision model aligned with an existing
text-based retriever through a simple alignment network. FLMR also encodes
images and questions using multi-dimensional embeddings to capture
finer-grained relevance between queries and documents. FLMR significantly
improves the original RA-VQA retriever's PRRecall@5 by approximately 8\%.
Finally, we equipped RA-VQA with two state-of-the-art large
multi-modal/language models to achieve $\sim61\%$ VQA score in the OK-VQA
dataset.
Related papers
- Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering [44.54319663913782]
We propose Retrieval-Augmented MLLM with Compressed Contexts (RACC)
RACC achieves a state-of-the-art (SOTA) performance of 62.9% on OK-VQA.
It significantly reduces inference latency by 22.0%-59.7% compared to the prominent RAVQA-v2.
arXiv Detail & Related papers (2024-09-11T15:11:39Z) - Multimodal Reranking for Knowledge-Intensive Visual Question Answering [77.24401833951096]
We introduce a multi-modal reranker to improve the ranking quality of knowledge candidates for answer generation.
Experiments on OK-VQA and A-OKVQA show that multi-modal reranker from distant supervision provides consistent improvements.
arXiv Detail & Related papers (2024-07-17T02:58:52Z) - Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual
Question Answering [32.21000330743921]
We propose a novel framework that endows the model with capabilities of answering more general questions.
Specifically, a well-defined detector is adopted to predict image-question related relation phrases.
The optimal answer is predicted by choosing the supporting fact with the highest score.
arXiv Detail & Related papers (2023-12-20T02:35:18Z) - VQA4CIR: Boosting Composed Image Retrieval with Visual Question
Answering [68.47402250389685]
This work provides a Visual Question Answering (VQA) perspective to boost the performance of CIR.
The resulting VQA4CIR is a post-processing approach and can be directly plugged into existing CIR methods.
Experimental results show that our proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets.
arXiv Detail & Related papers (2023-12-19T15:56:08Z) - Answer Mining from a Pool of Images: Towards Retrieval-Based Visual
Question Answering [7.3532068640624395]
We study visual question answering in a setting where the answer has to be mined from a pool of relevant and irrelevant images given as a context.
We propose a unified Multi Image BART (MI-BART) that takes a question and retrieved images using our relevance encoder for free-form fluent answer generation.
Our proposed framework achieves an accuracy of 76.5% and a fluency of 79.3% on the proposed dataset, namely RETVQA.
arXiv Detail & Related papers (2023-06-29T06:22:43Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - A Symmetric Dual Encoding Dense Retrieval Framework for
Knowledge-Intensive Visual Question Answering [16.52970318866536]
Knowledge-Intensive Visual Question Answering (KI-VQA) refers to answering a question about an image whose answer does not lie in the image.
This paper presents a new pipeline for KI-VQA tasks, consisting of a retriever and a reader.
arXiv Detail & Related papers (2023-04-26T16:14:39Z) - Toward Unsupervised Realistic Visual Question Answering [70.67698100148414]
We study the problem of realistic VQA (RVQA), where a model has to reject unanswerable questions (UQs) and answer answerable ones (AQs)
We first point out 2 drawbacks in current RVQA research, where (1) datasets contain too many unchallenging UQs and (2) a large number of annotated UQs are required for training.
We propose a new testing dataset, RGQA, which combines AQs from an existing VQA dataset with around 29K human-annotated UQs.
This combines pseudo UQs obtained by randomly pairing images and questions, with an
arXiv Detail & Related papers (2023-03-09T06:58:29Z) - Multi-Modal Fusion Transformer for Visual Question Answering in Remote
Sensing [1.491109220586182]
VQA allows a user to formulate a free-form question concerning the content of RS images to extract generic information.
Most of the current fusion approaches use modality-specific representations in their fusion modules instead of joint representation learning.
We propose a multi-modal transformer-based architecture to overcome this issue.
arXiv Detail & Related papers (2022-10-10T09:20:33Z) - Open Question Answering over Tables and Text [55.8412170633547]
In open question answering (QA), the answer to a question is produced by retrieving and then analyzing documents that might contain answers to the question.
Most open QA systems have considered only retrieving information from unstructured text.
We present a new large-scale dataset Open Table-and-Text Question Answering (OTT-QA) to evaluate performance on this task.
arXiv Detail & Related papers (2020-10-20T16:48:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.