Related papers: Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering

Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering

URL: http://arxiv.org/abs/2504.04065v2
Date: Tue, 01 Jul 2025 02:37:16 GMT
Title: Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering
Authors: Jiaqi Deng, Kaize Shi, Zonghan Wu, Huan Huo, Dingxian Wang, Guandong Xu,
Abstract summary: Vision Question Answering (KB-VQA) systems address complex visual-grounded questions with knowledge retrieved from external knowledge bases.<n>We propose a unified retrieval-augmented VQA framework with collaborative parametric knowledge calibration.<n>Our approach achieves competitive performance against state-of-the-art models, delivering a significant 4.7% improvement in answering accuracy.
Score: 16.14877145354785
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Knowledge-based Vision Question Answering (KB-VQA) systems address complex visual-grounded questions with knowledge retrieved from external knowledge bases. The tasks of knowledge retrieval and answer generation tasks both necessitate precise multimodal understanding of question context and external knowledge. However, existing methods treat these two stages as separate modules with limited interaction during training, which hinders bi-directional parametric knowledge sharing, ultimately leading to suboptimal performance. To fully exploit the cross-task synergy in KB-VQA, we propose a unified retrieval-augmented VQA framework with collaborative parametric knowledge calibration. The proposed framework can effectively adapt general multimodal pre-trained models for fine-grained, knowledge-intensive tasks while enabling the retriever and generator to collaboratively enhance and share their parametric knowledge during both training and inference. To enhance fine-grained understanding of questions and external documents, we also integrate late interaction mechanism into the proposed training framework. Additionally, we introduce a reflective-answering mechanism that allows the model to explicitly evaluate and refine its knowledge boundary. Our approach achieves competitive performance against state-of-the-art models, delivering a significant 4.7\% improvement in answering accuracy, and brings an average 7.5\% boost in base MLLMs' VQA performance.

Related papers

CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG [53.950029990391066]
Cross-source knowledge textbfReconciliation for Multimodal RAG (CoRe-MMRAG)<n>We propose a novel end-to-end framework that effectively reconciles inconsistencies across knowledge sources.<n>Experiments on KB-VQA benchmarks show that CoRe-MMRAG achieves substantial improvements over baseline methods.
arXiv Detail & Related papers (2025-06-03T07:32:40Z)
Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization [80.09112808413133]
Mujica is a planner that decomposes questions into acyclic graph of subquestions and a worker that resolves questions via retrieval and reasoning.<n>MyGO is a novel reinforcement learning method that replaces traditional policy updates with gradient Likelihood Maximum Estimation.<n> Empirical results across multiple datasets demonstrate the effectiveness of MujicaMyGO in enhancing multi-hop QA performance.
arXiv Detail & Related papers (2025-05-20T18:33:03Z)
Q-Adapt: Adapting LMM for Visual Quality Assessment with Progressive Instruction Tuning [49.07442840323135]
We propose a new paradigm for perception-oriented instruction tuning, i.e., Q-Adapt.<n>Our proposed Q-Adapt can achieve a lightweight visual quality evaluator, demonstrating comparable performance.
arXiv Detail & Related papers (2025-04-02T12:02:57Z)
Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering [12.622529359686016]
Visual Question Answering (VQA) focuses on providing answers to natural language questions by utilizing information from images.<n>Retrieval-augmented generation (RAG) leveraging external knowledge bases (KBs) emerges as a promising approach.<n>This study presents two key innovations. First, we introduce fine-grained knowledge units that consist of multimodal data fragments.<n>Second, we propose a knowledge unit retrieval-augmented generation framework (KU-RAG) that seamlessly integrates fine-grained retrieval with MLLMs.
arXiv Detail & Related papers (2025-02-28T11:25:38Z)
Open-Ended and Knowledge-Intensive Video Question Answering [20.256081440725353]
We investigate knowledge-intensive video question answering (KI-VideoQA) through the lens of multi-modal retrieval-augmented generation.<n>Our analysis examines various retrieval augmentation approaches using cutting-edge retrieval and vision language models.<n>We achieve a substantial 17.5% improvement in accuracy on multiple choice questions in the KnowIT VQA dataset.
arXiv Detail & Related papers (2025-02-17T12:40:35Z)
Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning [0.0]
We propose an Adaptive Knowledge-Guided Pretraining for Large Vision-Language Models (AKGP-LVLM)<n>It incorporates structured and unstructured knowledge into LVLMs during pretraining and fine-tuning.<n>We evaluate our method on four benchmark datasets, demonstrating significant performance improvements over state-of-the-art models.
arXiv Detail & Related papers (2025-01-15T05:45:04Z)
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
Multimodal Reranking for Knowledge-Intensive Visual Question Answering [77.24401833951096]
We introduce a multi-modal reranker to improve the ranking quality of knowledge candidates for answer generation. Experiments on OK-VQA and A-OKVQA show that multi-modal reranker from distant supervision provides consistent improvements.
arXiv Detail & Related papers (2024-07-17T02:58:52Z)
RAVEN: Multitask Retrieval Augmented Vision-Language Learning [5.1583788731239455]
The scaling of large language models to encode all the world's knowledge is unsustainable and has exacerbated resource barriers. Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored. This paper introduces RAVEN, a retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning.
arXiv Detail & Related papers (2024-06-27T13:08:35Z)
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning [49.3242278912771]
We introduce a novel multimodal RAG framework named RMR (Retrieval Meets Reasoning) The RMR framework employs a bi-modal retrieval module to identify the most relevant question-answer pairs. It significantly boosts the performance of various vision-language models across a spectrum of benchmark datasets.
arXiv Detail & Related papers (2024-05-31T14:23:49Z)
Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual Question Answering [32.21000330743921]
We propose a novel framework that endows the model with capabilities of answering more general questions. Specifically, a well-defined detector is adopted to predict image-question related relation phrases. The optimal answer is predicted by choosing the supporting fact with the highest score.
arXiv Detail & Related papers (2023-12-20T02:35:18Z)
Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering [47.668572102657684]
This work introduces a novel, explainable multi-agent collaboration framework by leveraging the expansive knowledge of Large Language Models (LLMs) to enhance the capabilities of Vision Language Models (VLMs)
arXiv Detail & Related papers (2023-11-29T03:10:42Z)
Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering [16.52970318866536]
This paper studies a category of visual question answering tasks, in which accessing external knowledge is necessary for answering the questions. A major step in developing OK-VQA systems is to retrieve relevant documents for the given multi-modal query. We propose an automatic data generation pipeline for pre-training passage retrieval models for OK-VQA tasks.
arXiv Detail & Related papers (2023-06-28T18:06:40Z)
End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries. We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion. We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z)
Dual Semantic Knowledge Composed Multimodal Dialog Systems [114.52730430047589]
We propose a novel multimodal task-oriented dialog system named MDS-S2. It acquires the context related attribute and relation knowledge from the knowledge base. We also devise a set of latent query variables to distill the semantic information from the composed response representation.
arXiv Detail & Related papers (2023-05-17T06:33:26Z)
FiTs: Fine-grained Two-stage Training for Knowledge-aware Question Answering [47.495991137191425]
We propose a Fine-grained Two-stage training framework (FiTs) to boost the KAQA system performance. The first stage aims at aligning representations from the PLM and the KG, thus bridging the modality gaps between them. The second stage, called knowledge-aware fine-tuning, aims to improve the model's joint reasoning ability.
arXiv Detail & Related papers (2023-02-23T06:25:51Z)
Retrieval Augmented Visual Question Answering with Outside Knowledge [14.371342370460685]
Outside-Knowledge Visual Question Answering (OK-VQA) is a challenging VQA task that requires retrieval of external knowledge to answer questions about images. Recent OK-VQA systems use Dense Passage Retrieval (DPR) to retrieve documents from external knowledge bases, such as Wikipedia, but with DPR trained separately from answer generation. We propose a joint training scheme which includes differentiable DPR integrated with answer generation so that the system can be trained in an end-to-end fashion.
arXiv Detail & Related papers (2022-10-07T20:35:58Z)
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering [63.87200781247364]
Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations. We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
arXiv Detail & Related papers (2022-09-14T22:04:10Z)
KAT: A Knowledge Augmented Transformer for Vision-and-Language [56.716531169609915]
We propose a novel model - Knowledge Augmented Transformer (KAT) - which achieves a strong state-of-the-art result on the open-domain multimodal task of OK-VQA. Our approach integrates implicit and explicit knowledge in an end to end encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation. An additional benefit of explicit knowledge integration is seen in improved interpretability of model predictions in our analysis.
arXiv Detail & Related papers (2021-12-16T04:37:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.