Related papers: UniRVQA: A Unified Framework for Retrieval-Augmented Vision Question Answering via Self-Reflective Joint Training

UniRVQA: A Unified Framework for Retrieval-Augmented Vision Question Answering via Self-Reflective Joint Training

URL: http://arxiv.org/abs/2504.04065v1
Date: Sat, 05 Apr 2025 05:42:12 GMT
Title: UniRVQA: A Unified Framework for Retrieval-Augmented Vision Question Answering via Self-Reflective Joint Training
Authors: Jiaqi Deng, Kaize Shi, Zonghan Wu, Huan Huo, Dingxian Wang, Guandong Xu,
Abstract summary: We propose a Unified Retrieval-Augmented VQA framework (UniRVQA) for knowledge-intensive visual questions.<n>UniRVQA adapts general multimodal pre-trained models for fine-grained knowledge-intensive tasks within a unified framework.<n>Our approach achieves competitive performance against state-of-the-art models, delivering a significant 4.7% improvement in answering accuracy, and brings an average 7.5% boost in base MLLMs' VQA performance.
Score: 16.14877145354785
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Knowledge-based Vision Question Answering (KB-VQA) systems address complex visual-grounded questions requiring external knowledge, such as web-sourced encyclopedia articles. Existing methods often use sequential and separate frameworks for the retriever and the generator with limited parametric knowledge sharing. However, since both retrieval and generation tasks require accurate understanding of contextual and external information, such separation can potentially lead to suboptimal system performance. Another key challenge is the integration of multimodal information. General-purpose multimodal pre-trained models, while adept at multimodal representation learning, struggle with fine-grained retrieval required for knowledge-intensive visual questions. Recent specialized pre-trained models mitigate the issue, but are computationally expensive. To bridge the gap, we propose a Unified Retrieval-Augmented VQA framework (UniRVQA). UniRVQA adapts general multimodal pre-trained models for fine-grained knowledge-intensive tasks within a unified framework, enabling cross-task parametric knowledge sharing and the extension of existing multimodal representation learning capability. We further introduce a reflective-answering mechanism that allows the model to explicitly evaluate and refine its knowledge boundary. Additionally, we integrate late interaction into the retrieval-augmented generation joint training process to enhance fine-grained understanding of queries and documents. Our approach achieves competitive performance against state-of-the-art models, delivering a significant 4.7% improvement in answering accuracy, and brings an average 7.5% boost in base MLLMs' VQA performance.

Related papers

CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG [53.950029990391066]
Cross-source knowledge textbfReconciliation for Multimodal RAG (CoRe-MMRAG)<n>We propose a novel end-to-end framework that effectively reconciles inconsistencies across knowledge sources.<n>Experiments on KB-VQA benchmarks show that CoRe-MMRAG achieves substantial improvements over baseline methods.
arXiv Detail & Related papers (2025-06-03T07:32:40Z)
Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization [80.09112808413133]
Mujica is a planner that decomposes questions into acyclic graph of subquestions and a worker that resolves questions via retrieval and reasoning.<n>MyGO is a novel reinforcement learning method that replaces traditional policy updates with gradient Likelihood Maximum Estimation.<n> Empirical results across multiple datasets demonstrate the effectiveness of MujicaMyGO in enhancing multi-hop QA performance.
arXiv Detail & Related papers (2025-05-20T18:33:03Z)
Q-Adapt: Adapting LMM for Visual Quality Assessment with Progressive Instruction Tuning [49.07442840323135]
We propose a new paradigm for perception-oriented instruction tuning, i.e., Q-Adapt.<n>Our proposed Q-Adapt can achieve a lightweight visual quality evaluator, demonstrating comparable performance.
arXiv Detail & Related papers (2025-04-02T12:02:57Z)
Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering [12.622529359686016]
Visual Question Answering (VQA) focuses on providing answers to natural language questions by utilizing information from images.<n>Retrieval-augmented generation (RAG) leveraging external knowledge bases (KBs) emerges as a promising approach.<n>This study presents two key innovations. First, we introduce fine-grained knowledge units that consist of multimodal data fragments.<n>Second, we propose a knowledge unit retrieval-augmented generation framework (KU-RAG) that seamlessly integrates fine-grained retrieval with MLLMs.
arXiv Detail & Related papers (2025-02-28T11:25:38Z)
Open-Ended and Knowledge-Intensive Video Question Answering [20.256081440725353]
We investigate knowledge-intensive video question answering (KI-VideoQA) through the lens of multi-modal retrieval-augmented generation.<n>Our analysis examines various retrieval augmentation approaches using cutting-edge retrieval and vision language models.<n>We achieve a substantial 17.5% improvement in accuracy on multiple choice questions in the KnowIT VQA dataset.
arXiv Detail & Related papers (2025-02-17T12:40:35Z)
Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning [0.0]
We propose an Adaptive Knowledge-Guided Pretraining for Large Vision-Language Models (AKGP-LVLM)<n>It incorporates structured and unstructured knowledge into LVLMs during pretraining and fine-tuning.<n>We evaluate our method on four benchmark datasets, demonstrating significant performance improvements over state-of-the-art models.
arXiv Detail & Related papers (2025-01-15T05:45:04Z)
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
Multimodal Reranking for Knowledge-Intensive Visual Question Answering [77.24401833951096]
We introduce a multi-modal reranker to improve the ranking quality of knowledge candidates for answer generation. Experiments on OK-VQA and A-OKVQA show that multi-modal reranker from distant supervision provides consistent improvements.
arXiv Detail & Related papers (2024-07-17T02:58:52Z)
RAVEN: Multitask Retrieval Augmented Vision-Language Learning [5.1583788731239455]
The scaling of large language models to encode all the world's knowledge is unsustainable and has exacerbated resource barriers. Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored. This paper introduces RAVEN, a retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning.
arXiv Detail & Related papers (2024-06-27T13:08:35Z)
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning [49.3242278912771]
We introduce a novel multimodal RAG framework named RMR (Retrieval Meets Reasoning) The RMR framework employs a bi-modal retrieval module to identify the most relevant question-answer pairs. It significantly boosts the performance of various vision-language models across a spectrum of benchmark datasets.
arXiv Detail & Related papers (2024-05-31T14:23:49Z)
Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual Question Answering [32.21000330743921]
We propose a novel framework that endows the model with capabilities of answering more general questions. Specifically, a well-defined detector is adopted to predict image-question related relation phrases. The optimal answer is predicted by choosing the supporting fact with the highest score.
arXiv Detail & Related papers (2023-12-20T02:35:18Z)
Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering [47.668572102657684]
This work introduces a novel, explainable multi-agent collaboration framework by leveraging the expansive knowledge of Large Language Models (LLMs) to enhance the capabilities of Vision Language Models (VLMs)
arXiv Detail & Related papers (2023-11-29T03:10:42Z)
Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering [16.52970318866536]
This paper studies a category of visual question answering tasks, in which accessing external knowledge is necessary for answering the questions. A major step in developing OK-VQA systems is to retrieve relevant documents for the given multi-modal query. We propose an automatic data generation pipeline for pre-training passage retrieval models for OK-VQA tasks.
arXiv Detail & Related papers (2023-06-28T18:06:40Z)
End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries. We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion. We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z)
Dual Semantic Knowledge Composed Multimodal Dialog Systems [114.52730430047589]
We propose a novel multimodal task-oriented dialog system named MDS-S2. It acquires the context related attribute and relation knowledge from the knowledge base. We also devise a set of latent query variables to distill the semantic information from the composed response representation.
arXiv Detail & Related papers (2023-05-17T06:33:26Z)
FiTs: Fine-grained Two-stage Training for Knowledge-aware Question Answering [47.495991137191425]
We propose a Fine-grained Two-stage training framework (FiTs) to boost the KAQA system performance. The first stage aims at aligning representations from the PLM and the KG, thus bridging the modality gaps between them. The second stage, called knowledge-aware fine-tuning, aims to improve the model's joint reasoning ability.
arXiv Detail & Related papers (2023-02-23T06:25:51Z)
Retrieval Augmented Visual Question Answering with Outside Knowledge [14.371342370460685]
Outside-Knowledge Visual Question Answering (OK-VQA) is a challenging VQA task that requires retrieval of external knowledge to answer questions about images. Recent OK-VQA systems use Dense Passage Retrieval (DPR) to retrieve documents from external knowledge bases, such as Wikipedia, but with DPR trained separately from answer generation. We propose a joint training scheme which includes differentiable DPR integrated with answer generation so that the system can be trained in an end-to-end fashion.
arXiv Detail & Related papers (2022-10-07T20:35:58Z)
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering [63.87200781247364]
Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations. We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
arXiv Detail & Related papers (2022-09-14T22:04:10Z)
KAT: A Knowledge Augmented Transformer for Vision-and-Language [56.716531169609915]
We propose a novel model - Knowledge Augmented Transformer (KAT) - which achieves a strong state-of-the-art result on the open-domain multimodal task of OK-VQA. Our approach integrates implicit and explicit knowledge in an end to end encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation. An additional benefit of explicit knowledge integration is seen in improved interpretability of model predictions in our analysis.
arXiv Detail & Related papers (2021-12-16T04:37:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.